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(57) Abstract 

A concatenates for a first digital 
frame with a second digital frame, such as 
the ending and beginning of adjacent di- 
phone strings being concatenated to form 
speech is based on determining an optnnun 
blend point for the first and second digi- 
tal frames in response to the magnitudes 
of samples in the first and second digital 
frames. The frames are then blended to 
generate a digital sequence representing a 
concatenation of the first and second frames 
with reference to the optimum blend point 
The system operates by first computing an 
extended frame in response to the first dig- 
ital frame, and men finding a subset of 
the extended frame with matches the sec- 
ond digital frame using a minim um aver- 
age magnitude difference function over the 
samples in the subset The blend point is 
the first sample of the matching subset To 
generate the concatenated waveform, the 
subset of the extended frame is combined 
with the second digital frame and concate- 
nated with the beginning segments of the 
extended frame to produce the concatenate 
waveform. 
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WAVEFORM BLENDING TECHNIQUE FOR 
TEXT-TO-SPEECH SYSTEM 



LIMITED COPYRI GHT WAIVFR 
A portion of the disclosure of this patent document 
5 contains material to which the claim of copyright protection is made. 
The copyright owner has no objection to the facsimile reproduction by 
any person of the patent document or the patent disclosure, as it 
appears in the U.S. Patent and Trademark Office file or records, but 
reserves all other rights whatsoever. 

10 BACKGROUND OF THF IMVFMTtnM 

Field of t he Invention 

The present invention relates to systems for smoothly 
concatenating quasi-periodic waveforms, such as encoded diphone 
records used in translating text in a computer system to synthesized 
1 5 speech. 

Description nf the RalatPri Art 

In text-to-speech systems, stored text in a computer is translated 
to synthesized speech. As can be appreciated, this kind of system 
would have wide spread application if it were of reasonable cost. For 
instance, a text-to-speech system could be used for reviewing electronic 
mail remotely across a telephone line, by causing the computer storing 
the electronic mail to synthesize speech representing the electronic mail. 
Also, such systems could be used for reading to people who are visually 
impaired. In the word processing context, text-to-speech systems might 
25 be used to assist in proofreading a large document. 

However in prior art systems which have reasonable cost, the 
quality of the speech has been relatively poor making it uncomfortable 
to use or difficult to understand. In order to achieve good quality 
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speech, prior art speech synthesis systems need specialized hardware 
which is very expensive, and/or a large amount of memory space in the 
computer system generating the sound. 

In text-to-speech systems, an algorithm reviews an input text 
string, and translates the words in the text string into a sequence of 
diphones which must be translated into synthesized speech. Also, text- 
to-speech systems analyze the text based on word type and context to 
generate intonation control used for adjusting the duration of the sounds 
and the pitch of the sounds involved in the speech. 

Diphones consist of a unit of speech composed of the transition 
between one sound, or phoneme, and an adjacent sound, or phoneme. 
Diphones typically start at the center of one phoneme and end at the 
center of a neighboring phoneme. This preserves the transition between 
the sounds relatively well. 

American English based text-to-speech systems, depending on the 
particular implementation, use about fifty different sounds referred to as 
phones. Of these fifty different sounds, the standard language uses 
about 1 800 diphones out of possible 2500 phone pairs. Thus, a text-to- 
speech system must be capable of reproducing 1800 diphones. To 
store the speech data directly for each diphone would involve a huge 
amount of memory. Thus, compression techniques have evolved to limit 
the amount of memory required for storing the diphones. 

Prior art text-to-speech systems are described in part in United 
States Patent No. 8,452,168, entitled COMPRESSION OF STORED 
WAVE FORMS FOR ARTIFICIAL SPEECH, invented by Sprague; and 
United States Patent No. 4.692,941, entitled REAL-TIME TEXT-TO- 
SPEECH CONVERSION SYSTEM, invented by Jacks, et al. Further 
background concerning speech synthesis may be found in United States 
Patent No. 4,384,169, entitled METHOD AND APPARATUS FOR 
30 SPEECH SYNTHESIZING, invented by Mozer, et al. 
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Two concatenated diphones will have an ending frame and a 
beginning frame. The ending frame of the left diphone must be blended 
with the beginning frame of the right diphone without audible 
discontinuities or clicks being generated. Since the right boundary of 
the first diphone and the left boundary of the second diphone 
correspond to the same phoneme in most situations, they are expected 
to be similar looking at the point of concatenation. However, because 
the two diphone codings are extracted from different contexts, they will 
not look identical. Thus, blending techniques of the prior art have 
attempted to blend concatenated waveforms at the end and beginning 
of left and right frames, respectively. Because the end and beginning of 
frames may not match well, blending noise results. Continuity of sound 
between adjacent diphones is thus distorted. 

Notwithstanding the prior work in this area, the use of text-to- 
speech systems has not gained widespread acceptance. It is desireable 
therefore to provide a software only text-to-speech system which is 
portable to a wide variety of microcomputer platforms, produces high 
quality speech and operates in real time on such platforms. 

SUMMARY OF THE INVFNTIDM 
The present invention provides an apparatus for concatenating a 
first digital frame with a second digital frame of quasi-periodic 
waveforms, such as the ending and beginning of adjacent diphone 
strings being concatenated to form speech. The system is based on 
determining an optimum blend point for the first and second digital 
frames in response to the magnitudes of samples in the first and second 
digital frames. The frames are then blended to generate a digital 
sequence representing a concatenation of the first and second frames, 
with reference to the optimum blend point. This has the effect of 
pr viding much better continuity in the blending or concatenation of 
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diphones in text-to-speech systems than has been available in the prior 
art. 

Further, the technique is applicable to concatenating any two 
quasi-periodic waveforms, commonly encountered in sound synthesis or 
5 speech, music, sound effects, or the like. 

According to one aspect of the present invention, the system 
operates by first computing an extended frame in response to the first 
digital frame, and then finding a subset of the extended frame which 
matches the second digital frame relatively well. The optimum blend 
10 point is then defined as a sample in the subset of the extended frame. 
The subset of the extended frame which matches the second digital 
frame relatively well is determined using a minimum average magnitude 
difference function over the samples in the subset. The blend point in 
this aspect comprises the first sample of the subset. To generate the 
1 5 concatenated waveform, the subset of the extended frame is combined 
with the second digital frame and concatenated with the beginning 
segment of the extended frame to produce the concatenate waveform. 

The concatenated sequence is then converted to analog form, or 
other physical representation of the waveforms being blended. 
20 According to another aspect, the present invention provides an 

apparatus for synthesizing speech in response to text. The system 
includes a translator, by which text is translated to a sequence of sound 
segment codes which identify diphones. Next, a decoder is applied to 
the sequence of sound segment codes to produce strings of digital 
25 frames which represent diphones for respective sound segment codes 
in the sequence. A concatenator is provided by which a first digital 
frame at the ending of an identified string of digital frames for a 
particular sound segment code in the sequence is concatenated with a 
second digital frame at the beginning of an identified string of digital 
30 frames of an adjacent sound segment code in the sequence to produc 
a speech data sequence. The concatenating system includes a buffer 
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to store samples of the first and second digital frames. Software, or 
other processing resources, determine a blend point for the first and 
second digital frames and blend the first and second frames in response 
to the blend point to produce a concatenation of the first and second 
sound segments. An audio transducer is coupled to the concatenating 
system to generate synthesized speech in response to the speech data 
sequence. 

In one embodiment of the invention, the resources that determine 
the optimum blend point include computing resources that compute an 
extended frame comprising a discontinuity smoothed concatenation of 
the first digital frame with a replica of the first digital frame. Further 
resources find a subset of the extended frame with a minimum average 
magnitude difference between the samples in the subset and in the 
second digital frame and define the optimum blend point as the first 
sample in the subset. The blending resources include software or other 
computing resources that supply a first set of samples derived from the 
first digital frame and the blend point as a first segment of the digital 
sequence. Next, the second digital frame is combined with the subset 
of the extended frame, with emphasis on the subset of the extended 
frame in a starting sample and emphasis on the second digital frame in 
an ending sample to produce a second segment of the digital sequence. 
The first segment and second segment are combined produce the 
speech data sequence. 

According to yet further aspects of the present invention, the 
text-to-speech apparatus includes a processing module for adjusting the 
pitch and duration of the identified strings of digital frames in the speech 
data sequence in response to the input text. Also, the decoder is based 
on a vector quantization technique which provides excellent quality 
compression with very small decoding resources required. 
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Other aspects and advantages of the present invention can be 
seen upon review of the figures, the detailed description, and the claims 
which follow. 

BRIEF DESCRIPTION OF THE FIGURFft 

Fig. 1 is a block diagram of a generic hardware platform 
incorporating the text-to-speech system of the present invention. 

Fig. 2 is a flow chart illustrating the basic text-to-speech routine 
according to the present invention. 

Fig. 3 illustrates the format of diphone records according to one 
embodiment of the present invention. 

Fig. 4 is a flow chart illustrating the encoder for speech data 
according to the present invention. 

Fig. 5 is a graph discussed in reference to the estimation of pitch 
filter parameters in the encoder of Fig. 4. 

Fig. 6 is a flow chart illustrating the full search used in the 
encoder of Fig. 4. 

Fig. 7 is a flow chart illustrating a decoder for speech data 
according to the present invention. 

Fig. 8 is a flow chart illustrating a technique for blending the 
beginning and ending of adjacent diphone records. 

Fig. 9 consists of a set of graphs referred to in explanation of the 
blending technique of Fig. 8. 

Fig. 10 is a graph illustrating a typical pitch versus time diagram 
for a sequence of frames of speech data. 

Fig. 1 1 is a flow chart illustrating a technique for increasing the 
pitch period of a particular frame. 

Fig. 12 is a set of graphs referred to in explanation of the 
technique of Fig. 1 1 . 

Fig. 13 is a flow chart illustrating a technique for decreasing the 
pitch period of a particular frame. 
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Fig. 14 is a set of graphs referred to in explanation of th 
technique of Fig. 13. 

Fig. 1 5 is a flow chart illustrating a technique for inserting a pitch 
period between two frames in a sequence. 
5 Fig. 16 is a set of graphs referred to in explanation of the 

technique of Fig. 1 5. 

Fig. 17 is a flow chart illustrating a technique for deleting a pitch 
period in a sequence of frames. 

Fig. 18 is a set of graphs referred to in explanation of the 
10 technique of Fig. 17. 

DETAILED DESCRIPTIO N OF PREFERRED EMBOniMFKfT g 
A detailed description of preferred embodiments of the present 
invention is provided with reference to the figures. Figs. 1 and 2 
provide a overview of a system incorporating the present invention. Fig. 
15 3 illustrates the basic manner in which diphone records are stored 
according to the present invention. Figs. 4-6 illustrate the encoding 
methods based on vector quantization of the present invention. Fig. 7 
illustrates the decoding algorithm according to the present invention. 
Figs. 8 and 9 illustrate a preferred technique for blending the 
20 beginning and ending of adjacent diphone records. Figs. 10-18 illustrate 
the techniques for controlling the pitch and duration of sounds in the 
text-to-speech system. 

I. System Overview (Figs. 1-3) 

Fig. 1 illustrates a basic microcomputer platform incorporating a 
25 text-to-speech system based on vector quantization according to the 
present invention. The platform includes a central processing unit 10 
coupled to a host system bus 11. A keyboard 12 or other text input 
device is provided in the system. Also, a display system 1 3 is coupled 
to the host system bus. The host system also includes a non-volatile 
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storage system such as a disk drive 14. Further, the system includes 
host memory 1 5. The host memory includes text-to-speech (TTS) code, 
including encoded voice tables, buffers, and other host memory. The 
text-to-speech code is used to generate speech data for supply to an 
audio output module 1 6 which includes a speaker 1 7. The code also 
includes an optimum blend point, diphone concatenation routine as 
described in detail with reference to Figs. 8 and 9. 

According to the present invention, the encoded voice tables 
include a TTS dictionary which is used to translate text to a string of 
diphones. Also included is a diphone table which translates the 
diphones to identified strings of quantization vectors. A quantization 
vector table is used for decoding the sound segment codes of the 
diphone table into the speech data for audio output. Also, the system 
may include a vector quantization table for encoding which is loaded into 
the host memory 1 5 when necessary. 

The platform illustrated in Fig. 1 represents any generic 
microcomputer system, including a Macintosh based system, an DOS 
based system, a UNIX based system or other types of microcomputers. 
The text-to-speech code and encoded voice tables according to the 
present invention for decoding occupy a relatively small amount of host 
memory 1 5. For instance, a text-to-speech decoding system according 
to the present invention may be implemented which occupies less than 
640 kilobytes of main memory, and yet produces high quality, natural 
sounding synthesized speech. 

The basic algorithm executed by the text-to-speech code is 
illustrated in Fig. 2. The system first receives the input text (block 20). 
The input text is translated to diphone strings using the TTS dictionary 
(block 21). At the same time, the input text is analyzed to generate 
intonation control data, to control the pitch and duration of the diphones 
making up the speech (block 22). 
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After the text has been translated to diphone strings, the diphone 
strings are decompressed to generate vector quantized data frames 
(block 23). After the vector quantized (VQ) data frames are produced, 
the beginnings and endings of adjacent diphones are blended to smooth 
5 any discontinuities (block 24). Next, the duration and pitch of the 
diphone VQ data frames are adjusted in response to the intonation 
control data (block 25 and 26). Finally, the speech data is supplied to 
the audio output system for real time speech production (block 27). For 
systems having sufficient processing power, an adaptive post filter may 

10 be applied to further improve the speech quality. 

The TTS dictionary can be implemented using any one of a variety 
of techniques known in the art. According to the present invention, 
diphone records are implemented as shown in Fig. 3 in a highly 
compressed format. 

1 5 As shown in Fig. 3, records for a left diphone 30 and a record for 

a right diphone 31 are shown. The record for the left diphone 30 
includes a count 32 of the number NL of pitch periods in the diphone. 
Next, a pointer 33 is included which points to a table of length NL 
storing the number LP. for each pitch period, i goes from 0 to NL-1 of 

20 pitch values for corresponding compressed frame records. Finally, 
pointer 34 is included to a table 36 of ML vector quantized compressed 
speech records, each having a fixed set length of encoded frame size 
related to nominal pitch of the encoded speech for the left diphone. The 
nominal pitch is based upon the average number of samples for a given 

25 pitch period for the speech data base. 

A similar structure can be seen for the right diphone 31 . Using 
vector quantization, a length of the compressed speech records is very 
short relative to the quality of the speech generated. 

The format of the vector quantized speech records can be 

30 understood further with reference to the frame encoder routine and the 
frame decoder routine described below with reference to Figs. 4-7, 
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II. The Encoder/Decoder Routines (Figs. 4-7) 

The encoder routine is illustrated in Fig. 4. The encoder accepts 

as input a frame s r of speech data. In the preferred system, the speech 

samples are represented as 12 or 16 bit two's complement numbers, 

5 sampled at 22,252 Hz. This data is divided into non-overlapping frames 

s n having a length of N, where N is referred to as the frame size. The 

value of N depends on the nominal pitch of the speech data. If the 

nominal pitch of the recorded speech is less than 165 samples (or 135 

Hz), the value of N is chosen to be 96. Otherwise a frame size of 160 

10 is used. The encoder transforms the N-point data sequence s p into a 

byte stream of shorter length, which depends on the desired 

compression rate. For example, if N = 160 and very high data 

compression is desired, the output byte stream can be as short as 12 

eight bit bytes. A block diagram of the encoder is shown in Fig. 4. 

15 Thus, the routine begins by accepting a frame s (block 50). To 

n 

remove low frequency noise, such as DC or 60 Hz power line noise, and 

produce offset free speech data, signal s n is passed through a high pass 

filter. A difference equation used in a preferred system to accomplish 

this is set out in Equation 1 for 0^n<N. 

20 x = s - s. + 0.999 *x - 

n n n-1 n-1 

Equation 1 

The value x is the "offset free" signal. The variables s , and x 
are initialized to zero for each diphone and are subsequently updated 
using the relation of Equation 2. 
25 X-1 = x N and Sl = s N 

Equation 2 

This step can be referred to as offset compensation or DC 
removal (block 51). 

In order to partially decorrelate the speech samples and the 
30 quantization noise, the sequence x n is passed through a fixed first order 
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linear prediction filter. The difference equation to accomplish this is set 
forth in Equation 3. 

y n = x n - 0.875 * Vl 

Equation 3 

5 The linear prediction filtering of Equation 3 produces a frame y 

(block 52). The filter parameter, which is equal to 0.875 in Equation 3, 
will have to be modified if a different speech sampling rate is used. The 
value of x ^ is initialized to zero for each diphone, but will be updated 
in the step of inverse linear prediction filtering (block 60) as described 
10 below. 

It is possible to use a variety of filter types, including, for 
instance, an adaptive filter in which the filter parameters are dependent 
on the diphones to be encoded, or higher order filters. 

The sequence y n produced by Equation 3 is then utilized to 
1 5 determine an optimum pitch value, P , and an associated gain factor, 
fim P opt is com P uted usin 9 the functions s xy (P), s xx (P), s yy (P), and the 
coherence function Coh(P) defined by Equations 4, 5, 6 and 7 as set out 
below. 



20 V P) = I Y n *PBUF p _ P + n 



N-1 

I 

n=0 ' max 



Equation 4 



N-1 



s (P) = Z y n * y n 
xx n 'n 

25 n=0 



N-1 



Equation 5 



s (P) = I PBUF p .PBUF p _ p + n 

y N=0 max max 



30 



and 



Equation 6 
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Coh(P) = s xy (P) • s xy (P) / (s xx (P) * Syy {P)) 

Equation 7 

PBUF is a pitch buffer of size P mav , which is initialized to zero, 

and updated in the pitch buffer update block 59 as described below. 

5 P Qpt is the value of P for which Coh(P) is maximum and s xy (P) is 

positive. The range of P considered depends on the nominal pitch of the 

speech being coded. The range is (96 to 350) if the frame size is equal 

to 96 and is (160 to 414) if the frame size is equal to 160. P is 

max 

350 if nominal pitch is less than 160 and is equal to 414 otherwise. 

10 The parameter P Qpt can be represented using 8 bits. 

The computation of P Qpt can be understood with reference to Fig. 

5. In Fig. 5, the buffer PBUF is represented by the sequence 100 and 

the frame y n is represented by the sequence 101. In a segment of 

speech data in which the preceding frames are substantially equal to the 

1 5 frame y , PBUF and y^ will look as shown in Fig. 5. P will have the 
n n opt 

value at point 102, where the vector y n 101 matches as closely as 
possible a corresponding segment of similar length in PBUF 100. 

The pitch filter gain parameter 0 is determined using the 
expression of Equation 8. 

Equation 8 

P is quantized to four bits, so that the quantized value of 0 can 

range from 1/16 to 1, in steps of 1/16. 

Next, a pitch filter is applied (block 54). The long term 

25 correlations in the pre-emphasized speech data y n are removed using the 

relation of Equation 9. 

r n = y n .^*PBUF p 0 * n < N. 

max opt 



Equation 9 



This results in computation of a residual signal r . 
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Next, a scaling parameter G is generated using a block gain 
estimation routine (block 55). In order to increase the computational 
accuracy of the following stages of processing, the residual signal r R is 
rescaled. The scaling parameter, G, is obtained by first determining the 
5 largest magnitude of the signal r n and quantizing it using a 7-level 
quantizer. The parameter G can take one of the following 7 values: 
256, 512, 1024, 2048, 4096, 8192, and 16384. The consequence of 
choosing these quantization levels is that the rescaling operation can be 
implemented using only shift operations. 

10 Next the routine proceeds to residual coding using a full search 

vector quantization code (block 56). In order to code the residual signal 
r n , the n point sequence r n is divided into non-overlapping blocks of 
length M, where M is referred to as the "vector size". Thus, M sample 
blocks b.. are created, where i is an index from zero to M-1 on the block 

15 number, and j is an index from zero to N/M-1 on the sample within the 
block. Each block may be defined as set out in Equation 10. 

b ij = r Mi+j ' (0 * ' < N/M and j ^ 0 < M) 

Equation 10 

Each of these M sample blocks b.. will be coded into an 8 bit 
20 number using vector quantization. The value of M depends on the 
desired compression ratio. For example, with M equal to 1 6, very high 
compression is achieved (i.e., 1 6 residual samples are coded using only 
8 bits). However, the decoded speech quality can be perceived to be 
somewhat noisy with M = 16. On the other hand, with M = 2, the 
25 decompressed speech quality will be very close to that of uncompressed 
speech. However the length of the compressed speech records will be 
longer. The preferred implementation, the value M can take values 2, 
4, 8, and 16. 

The vector quantization is performed as shown in Fig. 6. Thus, 
30 for all blocks b.. a sequence of quantization vectors is identified (block 
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120). First, the components of block b.. are passed through a noise 
shaping filter and scaled as set out in Equation 1 1 (block 121). 
w. = 0.875 * w. - 0.5 * w. 2 + 0.4375 * w. 3 + b. jf 

0 < j < M 

5 v.. sb G * w. 0<; j < M 

Equation 1 1 

Thus, v.. is the jth component of the vector v., and the values w 
r w 2 and w g are the states of the noise shaping filter and are 
initialized to zero for each diphone. The filter coefficients are chosen to 
10 shape the quantization noise spectra in order to improve the subjective 
quality of the decompressed speech. After each vector is coded and 
decoded, these states are updated as described below with reference to 
blocks 124-126. 

Next, the routine finds a pointer to the best match in a vector 
15 quantization table (block 122). The vector quantization table 123 
consists of a sequence of vectors C Q through C 2g5 (block 123). 

Thus, the vector v. is compared against 256 M-point vectors, 
which are precomputed and stored in the code table 123. The vector 
Cq. which is closest to v. is determined according to Equation 1 2. The 
20 value C p for p = 0 through 255 represents the p th encoding vector from 
the vector quantization code table 123. 
M-1 

min I (v - C/ 
P 1 = 0 

25 Equation 12 

The closest vector C q . can also be determined efficiently using the 
technique of Equation 13. 



v. T • C q . < v. T • C p for all p(0^pss255) 



Equation 13 
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In Equation 13, the value vT 

represents the transpose of the vector v, 

and represents the inner product operation in the inequality. 

The encoding vectors C p in table 1 23 are utilized to match on the 

noise filtered value v... However in decoding, a decoding vector table 

5 125 is used which consists of a sequence of vectors QV . The values 

P 

QVp are selected for the purpose of achieving quality sound data using 

the vector quantization technique. Thus, after finding the vector C 

the pointer q is utilized to access the vector QV The decoded 

samples corresponding to the vector b. which is produced at step 55 of 

10 Fig. 4, is the M-point vector (1/G) * QV The vector C is related to 

qt p 

the vector QV p by the noise shaping filter operation of Equation 1 1 . 

Thus, when the decoding vector QV p is accessed, no inverse noise 

shaping filter needs to be computed in the decode operation. The table 

125 of Fig. 6 thus includes noise compensated quantization vectors. 

15 In continuing to compute the encoding vectors for the vectors b.. 

■J 

which make up the residual signal r n , the decoding vector of the pointer 

to the vector b. is accessed (block 124). That decoding vector is used 

for filter and PBUF updates (block 126). 

For the noise shaping filter, after the decoded samples are 

20 computed for each sub-block b., the error vector (b.-QV .) is passed 

i i qi 

through the noise shaping filter as shown in Equation 14. 

W. = 0.875 » W. 1 - 0.5 * W. 0 + 0.4375 * W. Q + fb~ - 

J J- 1 J-Z J-o Ij 

. Q V )J 

0 <> j < M 

25 Equation 14 

In Equation 14, the value QV qj (j) represents the j th component of 
the decoding vector QV qj . The noise shaping filter states for the next 
block are updated as shown in Equation 15. 

W -1 =W M-1 
30 w_ 2 = w M _ 2 
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W -3 - w M-3 



Equation 15 

This coding and decoding is performed for all of the N/M sub- 
blocks to obtain N/M indices to the decoding vector table 125. This 
5 string of indices Q n , for n going from zero to N/M-1 represent identifiers 
for a string of decoding vectors for the residual signal r . 

Thus, four parameters represent the N-point data sequence y : 

1) Optimum pitch, P Qpt (8 bits), 

2) Pitch filter gain, 0 (4 bits), 

10 3) Scaling parameter, G (3 bits), and 

4) A string of decoding table indices, Q n (0 ^ n < N/M). 
The parameters J3 and G can be coded into a single byte. Thus, 
only (N/M) plus 2 bytes are used to represent N samples of speech. For 
example, suppose nominal pitch is 100 samples long, and M = 16. In 
1 5 this case, a frame of 96 samples of speech are represented by 8 bytes: 
1 byte for P 1 byte for 0 and G, and 6 bytes for the decoding table 
indices Q^. If the uncompressed speech consists of 16 bit samples, 
then this represents a compression of 24:1. 

Back to Fig. 4, four parameters identifying the speech data are 
20 stored (block 57). In a preferred system, they are stored in a structure 
as described with respect to Fig. 3 where the structure of the frame can 
be characterized as follows: 

#define NumOfVectorsPerFrame (FrameSize / VectorSize) 

struct frame { 

25 unsigned Gain : 4; 

unsigned Beta : 3; 

unsigned UnusedBit: 1; 

unsigned char Pitch ; 

unsigned char VQcodesfNumOfVectorsPerFrame]; }; 
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The diphone record of Fig. 3 utilizing this frame structure can be 
characterized as follows: 

DiphoneRecord 
{ 

5 char LeftPhone, RightPhone; 

short LeftPitchPeriodCount,RightPitchPeriodCount; 
short *LeftPeriods, *RightPeriods; 
struct frame *LeftData, *RightData; 

1 0 These stored parameters uniquely provide for identification of the 

diphones required for text-to-speech synthesis. 

As mentioned above with respect to Fig. 6, the encoder continues 
decoding the data being encoded in order to update the filter and PBUF 
values. The first step involved in this is an inverse pitch filter (block 

1 5 58). With the vector r' n corresponding to the decoded signal formed by 
concatenating the string of decoding vectors to represent the residual 
signal r' n , the inverse filter is implemented as set out in Equation 16. 

V'n - r 'n + ' # PBUF Pmax - Popt + n' 0 s n < N. 

Equation 16 

20 Next, the pitch buffer is updated (block 59) with the output of the 

inverse pitch filter. The pitch buffer PBUF is updated as set out in 
Equation 17. 

PBUF n = PBUF (n + N) 0 * n < iP^ - N) 

PBUF (Pmax - N + n) = V'n °*"<" 
25 Equation 17 

Finally, the linear prediction filter parameters are updated using an 
inverse linear prediction filter step (block 60). The output of the inverse 
pitch filter is passed through a first order inverse linear prediction filter 
to obtain the decoded speech. The difference equation to implement 
30 this filter is set out in Equation 18. 



x' = 0.875 * x' „ + y' 
n n-1 y n 



Equation 18 
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In Equation 1 8, x' p is the decompressed speech. From this, the 
value of x ^ for the next frame is set to the value x N for use in the step 
of block 52. 

Fig. 7 illustrates the decoder routine. The decoder module 
5 accepts as input (N/M) + 2 bytes of data, generated by the encoder 
module, and applies as output N samples of speech. The value of N 
depends on the nominal pitch of the speech data and the value of M 
depends on the desired compression ratio. 

In software only text-to-speech systems, the computational 
10 complexity of the decoder must be as small as possible to ensure that 
the text-to-speech system can run in real time even on slow computers. 
A block diagram of the encoder is shown in Fig. 7. 

The routine starts by accepting diphone records at block 200. 
The first step involves parsing the parameters G, 0, P and the vector 
15 quantization string Q R (block 201). Next, the residual signal r' is 
decoded (block 202). This involves accessing and concatenating the 
decoding vectors for the vector quantization string as shown 
schematically at block 203 with access to the decoding vector table 
125. 

20 After tr »e residual signal r' n is decoded, an inverse pitch filter is 

applied (block 204). This inverse pitch filter is implemented as shown 
in Equation 1 9: 

V'n - r 'n + ^ SPBUF < P max " P opt + n) ' 0s5n < N - 

Equation 19 

25 SPBUF is a synthesizer pitch buffer of length P initialized as zero for 

max 

each diphone, as described above with respect to the encoder pitch 
buffer PBUF. 

For each frame, the synthesis pitch buffer is updated (block 205). 
The manner in which it is updated is shown in Equation 20: 

30 SPBUF n " SPBUF ,n + N, 0sn « P max- N > 

SPBUF (Pmax - N + n| = V' n 0 S n<N 
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Equation 20 

After updating SPBUF, the sequence y' n is applied to an inverse 

linear prediction filtering step (block 206). Thus, the output of the 

inverse pitch filter y' p is passed through a first order inverse linear 

5 prediction filter to obtain the decoded speech. The difference equation 

to implement the inverse linear prediction filter is set out in Equation 21 : 

x' = 0.875 * x' ., + y' 
n n-1 y n 

Equation 21 

In Equation 21 , the vector x' n corresponds to the decompressed 
10 speech. This filtering operation can be implemented using simple shift 
operations without requiring any multiplication. Therefore, it executes 
very quickly and utilizes a very small amount of the host computer 
resources. 

Encoding and decoding speech according to the algorithms 
15 described above, provide several advantages over prior art systems. 
First, this technique offers higher speech compression rates with 
decoders simple enough to be used in the implementation of software 
only text-to-speech systems on computer systems with low processing 
power. Second, the technique offers a very flexible trade-off between 
20 the compression ratio and synthesizer speech quality. A high-end 
computer system can opt for higher quality synthesized speech at the 
expense of a bigger RAM memory requirement. 



m " Waveform Blending Fn r Discontinuity Smonthinn (Rgs. 8 and 9) 
As mentioned above with respect to Fig. 2, the synthesized 

25 frames of speech data generated using the vector quantization technique 
may result in slight discontinuities between diphones in a text string. 
Thus, the text-to-speech system provides a module for blending the 
diphone data frames to smooth such discontinuities. The blending 
technique of the preferred embodiment is shown with respect to Figs. 

30 8 and 9. 
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Two concatenated diphones will have an ending frame and a 

beginning frame. The ending frame of the left diphone-must be blended 

with the beginning frame of the right diphone without audible 

discontinuities or clicks being generated. Since the right boundary of 

the first diphone and the left boundary of the second diphone 

correspond to the same phoneme in most situations, they are expected 

to be similar looking at the point of concatenation. However, because 

the two diphone codings are extracted from different context, they will 

not look identical. This blending technique is applied to eliminate 

discontinuities at the point of concatenation. In Fig. 9, the last frame, 

referring here to one pitch period, of the left diphone is designated L 

(0<=n<PL) at the top of the page. The first frame (pitch period) of the 

right diphone is designated R (0«sn<PR). The blending of L and R 

. n n 

according to the present invention will alter these two pitch periods only 

and is performed as discussed with reference to Fig. 8. The waveforms 

in Fig. 9 are chosen to illustrate the algorithm, and may not be 

representative of real speech data. 

Thus, the algorithm as shown in Fig. 8 begins with receiving the 
left and right diphone in a sequence (block 300). Next, the last frame 
of the left diphone is stored in the buffer l_ n (block 301 ). Also, the first 
frame of the right diphone is stored in buffer R n (block 302). 

Next, the algorithm replicates and concatenates the left frame L 
to form extend frame (block 303). In the next step, the discontinuities 
in the extended frame between the replicated left frames are smoothed 
(block 304). This smoothed and extended left frame is referred to as El 
in Fig. 9. n 
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The extended sequence El n (Osn < PL) is obtained in the first step 
as shown in Equation 22: 

E, n =L n n = 0,1,...,PL-1 
EI PL + n = L n n = O' 1 PL-1 

Equation 22 

Then discontinuity smoothing from the point n = PL is conducted 
according to the filter of Equation 23: 

EI PL-n - EI PL + n + fEI (PL-1) " El '(PL-1) J#An + 
n = 0,1 (PL/2). 



Equation 23 

In Equation 23, the value A is equal to 1 5/1 6 and EI' (RL 1 = EL + 3 
* (E^-EIq). Thus, as indicated in Fig. 9, the extended sequence El is 
substantially equal to L r on the left hand side, has a smoothed region 
beginning at the point PL and converges on the original shape of L 
15 toward the point 2PL. If L p was perfectly periodic, then 
EI PL-1 = E, 'pL-T 

In the next step, the optimum match of R n with the vector El is 
found. This match point is referred to as P . (Block 305.) This is 
accomplished essentially as shown in Fig. 9 by comparing R with El 
to find the section of El n which most closely matches R . This optimum 
blend point determination is performed using Equation 23 where W is 
the minimum of PL and PR, and AMDF represents the average 
magnitude difference function. 

W-1 

25 AMDF(p) m Z | El - R I 

n + p n 1 

n = 0 Equation 24 

This function is computed for values of p in the range of 0 to PL- 
1 . The vertical bars in the operation denote the absolute value. W is 
the window size for the AMDF computation. P Qpt is chosen to be the 
30 value at which AMDF(p) is minimum. This means that d = P 

opt 
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corresponds to the point at which sequences El (Osn<W) and 

n ir\ n + p 

n' W) are very c,ose t0 eacn other. 

After determining the optimum blend point P the waveforms 
are blended (block 306). The blending utilizes a first weighting ramp WL 
5 which is shown in Fig. 9 beginning at P Qpt in the El n trace. In a second 
ramp, WR is shown in Fig. 9 at the R n trace which is lined up with P 
Thus, in the beginning of the blending operation, the value of El^s 
emphasized. At the end of the blending operation, the value of r" is 
emphasized. n 

1 0 Before blending, the length PL of L n is altered as needed to ensure 

that when the modified l_ n and R n are concatenated, the waveforms are 

as continuous as possible. Thus, the length P'L is set to P if p k 

opt opt 

greater than PL/2. Otherwise, the length P'L is equal to W + P and 
• opt 
the sequence L n is equal to El n for Osns(P'L-l), 

1 5 Tne blending ramp beginning at P Qpt is set out in Equation 25: 

R n = El n + Popt + {R n ' El n + Popt , * (n+ 1,/W Osn<W 

R n = R n Wsn<PR 

Equation 25 

Thus, the sequences L r and R n are windowed and added to get 
20 the blended R n . The beginning of L n and the ending of R r are preserved 
to prevent any discontinuities with adjacent frames. 

This blending technique is believed to minimize blending noise in 
synthesized speech produced by any concatenated speech synthesis. 



IV. Pitch and Duration Modification (Figs. 10-18) 

As mentioned above with respect to Fig. 2, a text analysis 
program analyzes the text and determines the duration and pitch contour 
of each phone that needs to be synthesized and generates intonation 
control signals. A typical control for a phone will indicate that a given 
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phoneme, such as AE, should have a duration of 200 milliseconds and 
a pitch should rise linearly from 220Hz to 300Hz. This requirement is 
graphically shown in Fig. 10. As shown in Fig. 10, T equals the desired 
duration (e.g. 200 milliseconds) of the phoneme. The frequency f fa is 
5 the desired beginning pitch in Hz. The frequency f is the desired 
ending pitch in Hz. The labels P y P 2 ...,Pg indicate the number of 
samples of each frame to achieve the desired pitch frequencies f 
f 2 --,fg. The relationship between the desired number of samples, P., 
and the desired pitch frequency f. (f 1 = y, is defined by the relation: 
10 p j = F s /fj' where F g is the sampling frequency for the data. 

As can be seen in Fig. 10, the pitch period for a lower frequency period 
of the phoneme is longer than the pitch period for a higher frequency 
period of the phoneme. If the nominal frequency were Pg, then the 
algorithm would be required to lengthen the pitch period for frames P 1 
and P 2 and decrease the pitch periods for frames P^, P g and P g . Also, 
the given duration T of the phoneme will indicate how many pitch 
periods should be inserted or deleted from the encoded phoneme to 
achieve the desired duration period. Figs. 11 through 18 illustrate a 
preferred implementation of such algorithms. 
20 F '9- 1 1 illustrates an algorithm for increasing the pitch period, 

with reference to the graphs of Fig. 12. The algorithm begins by 
receiving a control to increase the pitch period to N + A, where N is the 
pitch period of the encoded frame. (Block 350). In the next step, the 
pitch period data is stored in a buffer x n (block 351). x r is shown in 

25 Fig. 12 at the top of the page. In the next step, a left vector L is 

n 

generated by applying a weighting function WL to the pitch period data 
x n with reference to A (block 352). This weighting function is 
illustrated in Equation 26 where M = N-A: 

L n = * n for 0:sn<A 



30 



L n = x n * (N-n)/(M + 1 ) forA^n<N 
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Equation 26 

As can be seen in Fig. 12 r the weighting function WL is constant from 
the first sample to sample A, and decreases from A to N. 

Next, a weighting function WR is applied to x n (block 353) as can 
5 be seen in the Fig. 12. This weighting function is executed as shown 
in Equation 27: 

R n = X n + A *< n + 1 >/< M +U forO<n<IM-A 

R n = X n + A for N-Asn<N. 

Equation 27 

10 As can be seen in Fig. 12, the weighting function WR increases 

from 0 to N-A and remains constant from N-A to N. The resulting 

waveforms L n and R p are shown conceptually in Fig. 12. As can be 

seen, L maintains the beginning of the sequence x , while R maintains 

11 n n 

the ending of the data x n . 

1 5 The pitch modified sequence y n is formed (block 354) by adding 

the two sequences as shown in Equation 28: 

y n = L n + R (n-A) 

Equation 28 

This is graphically shown in Fig. 12 by placing R n shifted by A below 

20 L . The combination of L and R shifted by A is shown to be y at the 

n ii n 

bottom of Fig. 12. The pitch period for y n is N + A. The beginning of 

y n is the same as the beginning of x n , and the ending of y n is 

substantially the same as the ending of x n - This maintains continuity 

with adjacent frames in the sequence, and accomplishes a smooth 

25 transition while extending the pitch period of the data. 

Equation 28 is executed with the assumption that L is 0, for 

n 

n<:N, and R R is 0 for n<0. This is illustrated pictorially in Fig. 12. 
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An efficient implementation of this scheme which requires at most 
one multiply per sample, is shown in Equation 29: 

y n = x n 0<;n<A 

Y n = x n + lVA" x n P(n * A + 1,/(,S| - A+1) Arsn<N 

5 v = x 

y n n - A 

N^n <N_, 



Equation 29 

This results in a new pitch period having a pitch period of N + A. 

There are also instances in which the pitch period must be 

decreased. The algorithm for decreasing the pitch period is shown in 

Fig. 13 with reference to the graphs of Fig. 14. Thus, the algorithm 

begins with a control signal indicating that the pitch period must be 

decreased to N-A. (Block 400). The first step is to store two 

consecutive pitch periods in the buffer x n (block 401 ). Thus, the buffer 

x n as can be seen in Fig. 14 consists of two consecutive pitch periods, 

with the period Nj being the length of the first pitch period, and N r being 

the length of the second pitch period. Next, two sequences L and R 

n n 

are conceptually created using weighting functions WL and WR (blocks 
402 and 403). The weighting function WL emphasizes the beginning of 
the first pitch period, and the weighting function WR emphasizes the 
ending of the second pitch period. These functions can be conceptually 
represented as shown in Equations 30 and 31, respectively: 
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L n = x n for 0^n<N ( - W 



L n = x p * <N,-n)/(W+l) W<sn<N. 



L n - 0 otherwise. 



Equation 30 



5 and 

R n " x n * < n - N |+ w -A + 1)/(W+1) for N,-W + A<sn<Nj + A 

R n = x n for N | + A^n<N, + N r 

R n ~ 0 otherwise. 

Equation 31 

10 In these equations, A is equal to the difference between N ( and 

the desired pitch period N d . The value W is equal to 2*A, unless 2*A 
is greater than N d# in which case W is equal to N rf . 

These two sequences L R and R n are blended to form a pitch 
modified sequence y R (block 404). The length of the pitch modified 

1 5 sequence y n will be equal to the sum of the desired length and the 
length of the right phoneme frame N f . It is formed by adding the two 
sequences as shown in Equation 32: 

y n = L n + R (n + A) 

Equation 32 

20 Thus, when a pitch period is decreased, two consecutive pitch 

periods of data are affected, even though only the length of one pitch 
period is changed. This is done because pitch periods are divided at 
places where short-term energy is the lowest within a pitch period. 
Thus, this strategy affects only the low energy portion of the pitch 

25 periods. This minimizes the degradation in speech quality due to the 
pitch modification. It should be appreciated that the drawings in Fig. 1 4 
are simplified and do not represent actual pitch period data. 
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An efficient implementation of this scheme, which requires at 
most one multiply per sample, is set out in Equations 33 and 34. 
The first pitch period of length N rf is given by Equation 33: 

V 



v n = x n 0^n<N r W 



5 Y n = x n + [x n + A -x n J*(n-N I + W+1)/(W+1) N,-W<;n<N d 

Equation 33 

The second pitch period of length is generated as shown in 
Equation 34: 

V n = X n-A + l x n' x n-A P(n " A " N | +W + 1,/{W+1 ) 
10 N | ^n<N | +A 

Y n = x n N | + A* n<N | + N r 

Equation 34 

As can be seen in Fig. 14, the sequence l_ n is essentially equal to 
the first pitch period until the point N,-W. At that point, a decreasing 
1 5 ramp WL is applied to the signal to dampen the effect of the first pitch 
period. 

As also can be seen, the weighting function WR begins at the 
point Nj-W + A and applies an increasing ramp to the sequence x n until 
the point N ( + A. From that point, a constant value is applied. This has 

20 the effect of damping the effect of the right sequence and emphasizing 
the left during the beginning of the weighting functions, and generating 
a ending segment which is substantially equal to the ending segment of 
x n emphasizing the right sequence and damping the left. When the two 
functions are blended, the resulting waveform y is substantially equal 

25 to the beginning of x n at the beginning of the sequence, at the point 
Nj-W a modified sequence is generated until the point N ( . From Nj to the 
ending, sequence x n shifted by A results. 
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A need also arises for insertion of pitch periods to increase the 
duration of a given sound. A pitch period is inserted according to the 
algorithm shown in Fig. 15 with reference to the drawings of Fig. 16. 

The algorithm begins by receiving a control signal to insert a pitch 

5 period between frames L and R (block 450). Next, both L and R 

n n n n 

are stored in the buffer (block 451 ), where l_ n and R n are two adjacent 

pitch periods of a voice diphone. (Without loss of generality, it is 
assumed for the description that the two sequences are of equal lengths 
N.) 

1 0 ln or der to insert a pitch period, x n of the same duration, without 

causing a discontinuity between L and x„ and between x and R , the 

n n n n 

pitch period x r should resemble R n around n=0 (preserving L to x 

continuity), and should resemble L around n = N (preserving x to R 

n n n 

continuity). This is accomplished by defining x as shown in Equation 

15 35: 

X n = R n + (L n _ R n> * {<n + 1 ,/(N + 1 ,] 0sn<N-1 

Equation 35 

Conceptually, as shown in Fig. 15, the algorithm proceeds by 
generating a left vector WL(L n ), essentially applying to the increasing 
20 ramp WL to the signal L n - (Block 452). 

A right vector WR (RJ is generated using the weighting vector 
WR (block 453) which is essentially a decreasing ramp as shown in Fig. 
1 6. Thus, the ending of l_ n is emphasized with the left vector, and the 
beginning of R n is emphasized with the vector WR. 
25 Next, WR (l_ n ) and WR (R n > are blended to create an inserted 

period x n (block 454). 

The computation requirement for inserting a pitch period is thus 
just a multiplication and two additions per speech sample. 

Finally, concatenation of l_ n , x n and R n produces a sequence with 
30 an inserted pitch period (block 455). 
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Deletion of a pitch period is accomplished as sh wn in Fig. 17 

with reference to the graphs of Fig. 18. This algorithm, which is very 

similar to the algorithm for inserting a pitch period, begins with receiving 

a control signal indicating deletion of pitch period R which follows L 

n n 

5 (block 500). Next, the pitch periods L r and R n are stored in the buffer 
(block 501). This is pictorially illustrated in Fig. 18 at the top of the 
page. Again, without loss of generality, it is assumed that the two 
sequences have equal lengths N. 

The algorithm operates to modify the pitch period L n which 
10 precedes R n (to be deleted) so that it resembles R n# as n approaches N. 
This is done as set forth in Equation 36: 

L 'n = L n + (R n" L n } * [(n + 1)/(N+ 1,1 0<sn<N-1 

Equation 36 

In Equation 36, the resulting sequence L' is shown at the bottom of 
1 5 Fig. 1 8. Conceptually, Equation 36 applies a weighting function WL to 
the sequence L n (block 502). This emphasizes the beginning of the 
sequence L p as shown. Next, a right vector WR (R n ) is generated by 
applying a weighting vector WR to the sequence R r that emphasizes the 
ending of R n (block 503). 
20 W L (L n ) and WR (R n ) are blended to create the resulting vector 

L' n . (Block 504). Finally, the sequence t- n -R n is replaced with the 
sequence L' in the pitch period string. (Block 505). 

IV. Conclusion 

Accordingly, the present invention presents a software only text- 

25 to-speech system which is efficient, uses a very small amount of 
memory, and is portable to a wide variety of standard microcomputer 
platforms. It takes advantage of knowledge about speech data, and to 
create a speech compression, blending, and duration control routine 
which produces very high quality speech with very little computational 

30 resources. 
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A source code listing of the software for executing the 
compression and decompression, the blending, and the duration and 
pitch control routines is provided in the Appendix as an example of a 
preferred embodiment of the present invention. 

The foregoing description of preferred embodiments of the present 
invention has been provided for the purposes of illustration and 
description. It is not intended to be exhaustive or to limit the invention 
to the precise forms disclosed. Obviously, many modifications and 
variations will be apparent to practitioners skilled in this art. The 
embodiments were chosen and described in order to best explain the 
principles of the invention and its practical application, thereby enabling 
others skilled in the art to understand the invention for various 
embodiments and with various modifications as are suited to the 
particular use contemplated. It is intended that the scope of the 
invention be defined by the following claims and their equivalents. 
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APPENDIX 

©APPLE COMPUTER. INC. 1993 
37 C.F.R. § 1.96(a) 

COMPUTER PROGRAM LISTINGS 
TABLE OF CONTENTS 



Section p a ge 

I. ENCODER MODULE 33 

II. DECODER MODULE 43 

III. BLENDING MODULE 55 

IV. INTONATION ADJUSTMENT MODULE 59 
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I. ENCODER MODULE 



^include <stdio.h> 
#include <math.h> 
include <StdLib.h> 
^include < types. h> 
^include <fcntl.h> 
^include < string. h> 

^include <types.h> 
#include <files.h> 
^include <resources.h> 
#include <memory.h> 
#include "vqcoder.h" 

#define LAST_FRAME_FLAG 1 28 

#define PBUF_S!ZE 440 

static float oc_state[2], nsf_state[NSF_ORDER + 1]; 



#pragma segment vqlib 

/* Read Code Books */ 

float *EncodeBook[MAX_CBOOK_S!ZE]; 

short *DecodeBookIMAX_CBOOK_S!ZE]; 

get_cbook(short ratio) 

{ 

short *p; 

short frame_size. i; 
static short lastj-atio = 0; 

Handle h; 
int skip; 

h = GetResourceCCBOK',1); 

HLock(h); 

p = (short *) *h; 

if (ratio = = last_ratio) 

return; 
last_ratio = ratio; 

if (ratio < 3) 
return; 



static short 
static short 



pstate[PORDER + 1 J, dstatefPORDER + 1 J; 
AnaPbuf[PBUF_SIZE]; 



static short 



vsize, cbook_size, bs_size; 



if (NOMINAL_PITCH < 1 65) 
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frame_size = 96; 

else 

framesize = 160; 

get_compr_pars(ratio, frame_size, Stvsize, &cbook_size, &bs_size); 
skip = 0; 

while (plskip + 1] != vsize) 
{ 

short t1, t2; 
t2 = p(skip); 
t1 = plskip+1]; 

skip + = sizeof(float) * (2 * t2-1) * (t1 +1) / sizeof(short) 
+ (2 * t2 * t1 + 2); 

} 

/*Skip Binary search tree */ 

skip + = sizeof(float) * (cbook_size-1 ) * (vsize + 1) / sizeof (short) 
+ (cbook_size * vsize + 2); 

/* Get pointers to Full search code books */ 

for (i = 0; i < cbook_size; i + + ) 

{ 

EncodeBook[i] = (float *) &plskip]; 

skip + = (vsize + 1) • sizeof (float) / sizeof (short); 

} 



for (i = 0; i < cbook_size; i+ +) 
{ 

DecodeBookfi] = p + skip; 
skip + = vsize; 

} 

} 



char *getcbook(long *len, short ratio) 
{ 

get_cbook(ratio); 

*len = sizeof(short) * vsize * cbook_size; 

/* plus one is to make space at the end for the array of pointers V 
return (char*) DecodeBook[0]; 

} 

/* A Routine for Pitch filter parameter Estimation */ 

GetPitchFilterPars (x, ien, pbuf, min_pitch, max^pitch, pitch, beta) 

float *beta; 

short # x, *pbuf; 

short min_pitch, max^pitch; 

short Ien; 

unsigned int 'pitch; 

{ 

/* Estimate long-term predictor */ 
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int best_pitch, i, j; 

float syy, sxy, bdst_sxy = 0.0, best_syy = 1.0; 
short *ptr; 

bestjDitch = min_pitch; 

ptr = pbuf + PBUF_SIZE - min_pitch; 

syy = 1.0; 

for (i = 0; i < len; i+ + ) 
{ 

syy + = Cptr) * (*ptr); . 
ptr+ + ; 

} 

for (j = min pitch; j < max pitch; j + +) 
{ 

sxy = 0.0; 

ptr = pbuf + PBUF_SIZE - j; 
for (i = 0; i < len; i+ + ) 

sxy + = x[i) * (*ptr + +); 

if (sxy > 0 && (sxy * sxy * best_syy > best_sxy * best_sxy * syy)) 

best_syy = syy; 
best_sxy = sxy; 
best_pitch = j; 

} 

syy = syy - pbuf[PBUF_SIZE - j + len - 1] * pbuf[PBUF_S!ZE - j + len - 1] 
+ pbuf[PBUF_SIZE - j - 1J • pbuffPBUF_SIZE - j - 1]; 



* pitch = best_pitch; 

•beta = best_sxy / best_syy; 

} 

/* Quantization of LTP gain parameter */ 
CodePitchFiIterGain(beta, bcode) 
float beta; 

unsigned int * bcode; 
{ 

int i; 

for (i = 0; i < DLB_TAB_SIZE; i + +) 
{ 

if (beta < = dlbjtab[i]) 
break; 

} 

*bcode = i; 

} 

/* Pitch filter •/ 

PitchFilterldata, len, pbuf, pitch, ibeta) 
float *data; 
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short ibeta; 
short *pbuf; 
short len; 
unsigned int pitch; 
{ 

long pn; 
int i, j; 



j = PBUF_SIZE - pitch; 
for (i = 0; i < len; i+ + ) 
{ 

pn = {{ibeta * pbuflj+ +]) >> 4); 
datafi] -= pn; 

} 

} 

/• Forward Noise Shaping filter */ 

FNSFilter{float *inp, float * state, short len, float *out) 

short i, j; 

for (j = 0; j < len; j + + ) 
{ 

float tmp = inpfj); 

for (i = 1; i < = NSFJDRDER; i+ +) 

tmp + = statefi] * nsftij; 
outlj] = statelO] = tmp; 
for (i = NSFJDRDER; i > 0; M 

statefi] = statefi- 1]; 

} 

} 

/* Update Noise shaping Filter states */ 
UpdateNSFState(float *inp, float # state, short len) 

short i, j; 

float temp_state[NSF_ORDER + 1 J; 

for (i - 0; i < = NSFJDRDER; i+ + ) 
temp_statefi] = 0; 

for (j = 0; j < len; j + + ) 
{ 

float tmp = inpfj); 

for (i = 1; i < = NSFJ3RDER; i++) 
tmp + = temp_state[i] * nsf[i); 

temp_statefO] = tmp; 

for (i « NSF_ORDER; i > 0; H 

temp_state[i] = temp_state[i-1 J; 
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} 

for (i s 0; i < = NSF_ORDER; i + + ) 
stateli] = stated) - temp_state[i]; 



/* Quantization of Segment Power */ 
CodeBlockGain(power, gcode) 
float power; 
unsigned int *gcode; 
{ 

int i; 

for (i = 0; i < DLG TAB_SIZE; i+ + ) 

if (power < = dlg_tab[i]) 
break; 

} 

*gcode = i; 

} 

/* Full search Coder */ 

VQCoderttloat *x, float *nsf_state, short len, struct frame *bs) 

float max_x, tmp; 

int i, j, k, index, lshift_count; 

unsigned int gcode; 

float min_err = 0; 

ma x_x = x[0J; 
for (i = 1; i < len; i+ + ) 
if ( fabs(x[i)) > max_x) 
max_x = fabs(x[i]); 

CodeBIockGain(max_x, &gcode); 
max_x = qlg_tab[gcode]; 

lshift_count = 7 - gcode; /* To scale 14-bit Code book output to the 1 6-bit 

actual value */ 

bs-> gcode = gcode; 

for (i = 0; i < len; i + = vsize) 
{ 

/* Filter the data vector */ 
FNSFilter<&x[i], nsf_state, vsize, &x[i]); 

/* Scale data */ 
for (j = i; j < i + vsize; j+ + ) 
x[j] - xtj] * 1024 /max^x; 

index = 0; 

for (j = 0; j < cbook size; j+ + ) 
{ 
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tmp = EncodeBook[j]|vsize] * 1024.0; 
for (k = 0; k < vsize; k + +) 

tmp-= x[i + kj * EncodeBook[j][k]; 

if (tmp < min_err 1 1 j = = 0) 
{ 

index = j; 
min_err = tmp; 

} 

} 

bs->vqcode[i/vsize] = index; 

/* Rescale data: Decoded data is 14-bits, convert to 16 bits */ 

if (lshift_count) 

{ 

for (k = 0; k < vsize; k+ +) 
^ x[i + k] = ((4 * DecodeBooMindexHk]) >> lshift_count); 

else 

{ 

for (k = 0; k < vsize: k + +) 

x[i + k] = 4 ♦ DecodeBooklindexHkJ; 



/* Update noise shaping filter state */ 
UpdateNSFState(&x[i], nsf_state, vsize); 

} 

init_compress() 

{ ~ 

int i; 

oc_state[0] = 0;; 
oc_state[1] = 0;; 
for (i m 0; j < = PORDER; i + +) 

pstate[i] = dstatefi] = 0; 
for (i = 0; i < PBUF_SIZE; i + +) 

AnaPbufli] = 0; 
for (i =0; i < = NSFJ3RDER; i+ + ) 

nsf_state[i] = 0; 

} 

Encoder(xn, frame_size, min_pitch, max_pitch, bs) 
short xn[J; 
struct frame *bs; 

short frame_size, min_pitch, max_pitch; 

unsigned int pitch, bcode; 

float preemp_xnfPBUF_SlZE), beta; 

short xn_copy[PBUF_SIZE]; 
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short ibeta; 
float acc; 
int i, j; 



/* Offset Compensation */ 
for (i = 0; i < frame_size; i + + ) 

float inp = xn[i]; 

xnfi] = inp - oc_state[0] + ALPHA * oc state[1]- 
oc_state[1] = xn[i]; 
oc_state[0) = inp; 



/* Linear Prediction Filtering */ 
for (i = 0; i < frame_size; i + + ) 

acc = pstatefO) = xn[ij; 

for (j m 1; j < = PORDER; j+ + ) 

acc-= pstateljl * pfilttj]; 
xn_copy[i] = preemp_xn[i] = acc; 
for (j = PORDER; j > 0; j-) 

pstatefj) = pstate[j-1]; 



GetPitchFilterPars (xn_copy, frame_size, AnaPbuf, min pitch 

max_pitch, &pitch, &beta); 
CodePitchFilterGain(beta, &bcode); 
ibeta = qlb_tab[bcode]; 

bs->bcode = bcode; 

bs-> pitch = pitch - minjsitch + 1; 

PitchFilter(preemp_xn, frame_size, AnaPbuf, pitch, ibeta); 
VQCoder<preemp_xn, nsf_state, frame_size, bs); 

/* Inverse Filtering */ 

j = PBUF_SIZE - pitch; 

for (t = 0; i < f rame_size; i + + ) 

xn_copy[i] = preemp_xn[i]; 
^ xn_copy[i] + m {(ibeta # AnaPbuf [j + + ]) > > 4); 

/* Update Pitch Buffer */ 
j = 0; 

for (i = frame_size; i < PBUF_SIZE; i + +) 

AnaPbuf[j+ + ] = AnaPbuf[i]; 
for (i = 0; i < frame_size: i + + ) 
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AnaPbuf[j++l = xncopyli]; 

/* Inverse LP filtering V 

for (i = 0; i < frame size; i+ + ) 

{ 

acc = xn_copy[i); 

for (j = 1; j < = PORDER; j+ +} 

acc = acc + dstatelj] * pfiltfj]; 
dstatefO] = acc; 
for (j = PORDER; j > 0; j-) 

dstatelj] = dstatelj- 1]; 



for (j = 0; j < » PORDER; j+ + ) 
pstatelj] = dstatelj]; 



compress (short *input, short ilen, unsigned char * output, long *olen, long docomp) 

int i, j, vcount; 

unsigned char temp; 

shor * frame^size, min_pitch, max_pitch; 

if (docomp > 2) 
{ 

init_compress(); 

if (NOMINAL_PITCH < 165) 
{ 

minpitch = 96; 
frame_size = 96; 
max_pitch = 350; 

} 

else 

{ 

min_pitch = 160; 
frame_size = 160; 
max_pitch =414; 

} 

bs_size = frame_size / vsize + 2; 

/• TEMPORARY: Storing State information */ 

pstateH] = * (input - 1); 

if (pstateH] > 0) 

pstateH] = (pstate[1l + 128) / 256 + 128; 

else 

pstateH] = (pstated ]- 128)7 256 + 128; 



if (pstateH ] < 0) 
pstatell] = 0; 
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if (pstate[1] > 255) 

pstatetU = 255; 
♦output = pstate[1]; 

pstatefl] = pstatefl] - 128; 
pstatefl] = 256 * pstatefl]; 
dstate[1] = pstatefl]; 
/* End of Hack */ 

for (i = 0; i < ilen; i + = frame size) 
{ 

Encoder(input + i, frame_size, min_pitch, max_pitch, output + j); 
j + = bs_size; 

} 

j -= bs_size; 

/* Number of vectors in last frame */ 

vcount = (ilen + frame_size - i + vsize - 1) /vsize; 

temp = outputfj); 

outputlj] = vcount + LAST_FRAME_FLAG; 
outputfj + vcount + 2] = temp; 
*oien » j + vcount + 3; 

} 

else 

{ 

static long SampCount = 0; 
copytinput, output, 2*ilen); 
SampCount + = ilen; 
*olen = ilen; 

} 

} 

copyfa, b r len) 
short # a, *b; 
short len; 

{ 

int i; 

for (i = 0; i < len; i + + ) 
*b+ + = (*a + +); 

} 



WO 94/17517 



PCT/US94/00770 



-41 - 



II. DECODER MODULE 



^include < Types. h> 
^include < Memory. h> 
^include <Quickdraw.h> 
^include <ToolUtils.h> 
^include < errors. h> 
^include < files. h> 



^include "vtcint.h" 
#include <stdlib.h> 
#include <math.h> 
#include <sysequ.h> 
^include <string.h> 



#define MAX CBOOK_SIZE 
^define LAST_FRAME_FLAG 
^define PORDER 
^define IPCONS 



256 
128 



1 
7 



/• 7/8 •/ 



#define 
#define 



LARGE_NUM 
VOICED 1 



100000000 



^define LEFT 0 
^define RIGHT 1 
^define UNVOICED 0 

#define PFILT_ORDER 8 

struct frame { 
unsigned gcode : 4; 
unsigned bcode : 4; 
unsigned pitch : 8; 
unsigned char vqcodell; 

}; 

void expandlshort * *DecodeBook, short frame_size, short vsize, 
short min_pitch, struct frame *bs, short * output, short smpnum); 

get_compr_pars(short ratio, short frame_size, short *vsize, 
short *cbook_size, short *bs_size) 

{ 

switch (ratio) 

{ 

case 4: 

*vsize = 2; 
*cbook_size = 256; 
*bs_size = frame_size/2 + 2; 
break; 
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case 7: 

*vsize = 4; 

*cbook_size = 256; 

*bs_size = frame_size/4 + 2; 

break; 
case 14: 

*vsize = 8; 

*cbook_size = 256; 

*bs_size = frame_size/8 + 2; 

break; 
case 24; 

*vsize = 16; 

*cbook_size = 256; 

# bs_size = frame_size/1 6 + 2; 

break; 
default: 

*vsize = 2; 

# cbook_size = 256; 

*bs__size = frame_size/2 + 2; 

break; 

} 

} 

short *Snlnit(short comp_ratio) 

short * state, *ptr; 
int i; 

state m ptr = (short *)NewPtr((PFILT_ORDER + 1 + PFILT ORDER/2 + 2) • 
sizeof(short)); " 
if ( state = = nil ) 
{ 

return nil; 

} 

for (i = 0;i<PFILT_ORDER+1;i++) 
*ptr+ + = 0; 

/• 

if (comp_ratio = = 24) 
{ 

*ptr+ + = 0.036953 # 32768 + 0.5; 
# ptr+ + = -0.132232 * 32768 - 0.5; 
•ptr++ = 0.047798 * 32768 + 0.5; 
•ptr++ = 0.403220 * 32768 + 0.5; 
*ptr+ + = 0.290033 * 32768 + 0.5; 

else 
{ 

*ptr++ = 0.074539 * 32768 + 0.5; 
•ptr + + = -0.174290 * 32768 - 0.5; 
*ptr++ = 0.013704 • 32768 + 0.5; 
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*ptr++ = 0.426815 * 32768 + 0.5; 
*ptr+ + » 0.320707 * 32768 + 0.5; 

} 

*/ 

if (compj'atio = = 24) 
{ 

*ptr+ + = 1211; 
*ptr + + » -4333; 
*ptr + + = 1566; 
*ptr++ = 13213; 
*ptr + + = 9504; 

} 

else 
{ 

*ptr+ + m 2442; 
# ptr+ + = -5711; 
*ptr 4- + = 449; 
*ptr++ « 13986; 
*ptr++ = 10509; 

} 

•ptr = 0; /* DC value */ 

return state; 

} 

SnDone(char 'state) 
{ 

if ( state ! = nil ) 
{ 

DisposPtr(state); 

} 

} 

short •*SnDelnit(p, ratio, frame_size) 

short *p,ratio, frame size; 

{ 

int i; 

short cbook_size = 256, vsize = 16, bs_size; 
short "DecodeBook; 

get_compr_pars{ratio, frame_size, &vsize, &cbook_size, &bs_size); 

DecodeBook = (short* *)NewPtr(cbook_size * sizeof(short*)); 
if (DecodeBook) { 

for (j = 0; i < cbooksize; i+ 4-) 

{ 

DecodeBook[i] = p; 
p + = vsize; 

} 

} 

return DecodeBook; 
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} 

SnDeDone(char *DecodeBook) 
{ 

if ( DecodeBook ! = nil ) 
{ 

DisposPtr(DecodeBook); 

} 

} 

void 

expandtshort * *DecodeBook, short frame_size, short vsize, 

short min_pitch, struct frame *bs, short * output, short smpnum) 

{ 

short count; 

short # bptr, *sptr1, *sptr2; 
unsigned short pitch, bcode; 

/• 

short qlb_tab[] « { 

1. 2, 3, 4, 5, 6, 7, 8, 

9, 10, 11, 12, 13, 14, 15, 16 

}; 

# / 

bcode = bs-> bcode; 

pitch = bs-> pitch + min_pitch - 1; 

/* Decode VQ vectors */ 
{ 

unsigned char # cptr; 
short k, vsize_by_2; 

short rshiftcount = 7 - bs->gcode; /* We want the output to be 14-bit 
number */ 

sptrl = output + smpnum; 
cptr = bs->vqcode; 

vsize_by_2 = (vsize > > 1) + 1; /* +1 since we do a while H) instead of 
while */ 

if (rshift count) 
{ 

for (k = 0; k < frame_size; k + = vsize) 

{ 

' bptr = DecodeBookI*cptr+ +]; 
count = vsize_by_2; 
while (--count) 
{ 

*sptr1 + + = ((*bptr++) >> rshift_count); 
*sptr1 + + = ((*bptr++) >> rshift_count); 

} 

} 

} 
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else 
{ 



} 

} 



for (k = 0; k < frame_size; k + = vsize) 
{ 

bptr = DecodeBook[*cptr + +]; 
count = vsize_by_2; 
while {-count) 
{ 

*sptr1 + + = *bptr-f + ; 
*sptr1 + + = *bptr + + ; 

} 

} 



/* Inverse Filtering */ 
if (smpnum < pitch) 
{ 

sptrl = output + pitch; 

count e smpnum + frame_size + 1 - pitch; /* + 1 since we do a while 
instead of while (i~) */ 

sptr2 = sptrl - pitch; 
switch (bcode) 

{ 

case 0: 

while {-count) 

•sptrl + + + = {{*sptr2+ + ) > > 4); 
break; 
case 1 ; 

while (-count) 

*sptr1 + + + - ((*sptr2++) >> 3); 
break; 
case 2: 

while (—count) 

*sptr1 + + + = ((3 # (*sptr2 + +)) > > 4); 
break; 
case 3: 

while (-count) 

*sptr1 + + + = ((*sptr2 + + ) > > 2); 
break; 
case 4: 

while (-count) 

♦sptrl + + + = ((5 * (*sptr2+ +)) > > 4); 
break; 
case 5: 

while, (-count) 

•sptrl + + + = ((3 * (*sptr2+ +)J > > 3); 
break; 
case 6: 

while (-count) 
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# sptr1 + + + = ({7 * (*sptr2 + +)) > > 4); 
break: 
case 7: 

while (-count) 

# sptr1 + + + = ((* S ptr2++) >> 1); 
break; 
case 8: 

while (-count) 

{ 

long tmp; 

tmp = *sptr2+ + ; 

*sptr1 + + + = (({tmp < < 3) + tmp) > > 4); 

} 

break; 
case 9: 

while (-count) 

*sptr1 + + + = ((5 * (*sptr2+ + )) > > 3); 

break; 
case 10: 

while (-count) 

{ 

long tmp; 

tmp = *sptr2+ +; 

*sptr1 + + + = (((tmp < < 3) + 3 * tmp) > > 4); 

} 

break; 
case 1 1 : 

while (-count) 

*sptr1 + + + = ((3 * ( # sptr2+ +)) > > 2); 

break; 
case 12: 

while (-count) 

{ 

long tmp; 

tmp = *sptr2+ + ; 

♦sptrl + + + « (((tmp < < 4) - 3 * tmp) > > 4); 

} 

break; 
case 1 3: 

while (-count) 

*sptr1 + + + = ((7 * (*sptr2+ +)) > > 3); 

break; 
case 14: 

while (-count) 

{ 

long tmp; 

tmp = *sptr2+ +; 

# sptr1 + + + = (((tmp < < 4) - tmp) > > 4); 

} 

break; 
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case 15: 

while (--count) 

*sptr1 + + + = *sptr2+ + ; 
break; 

} 

} else { 

sptM = output + smpnum; 
sptr2 = sptrl - pitch; 
count = (frame_size / 4) + 1 ; 
switch (bcode) 

{ 

case 0: 

while (-count) { 

*sptr1 + + + = ((*sptr2 + +) > > 4); 
*sptr1 + + + = ((*sptr2+ + ) > > 4); 
*sptr1 + + + = ((*sptr2 + + ) > > 4); 
*sptr1 + + + = ((*sptr2+ +) > > 4); 

} 

break; 
case 1 : 

while (-count) { 

*sptr1 + + += ((*sptr2++) >> 3); 

•sptrl ++ 4- = <(*sptr2++) >> 3); 

*sptr1 + + + = ((*sptr2 + + ) > > 3); 

*sptr1 + + + « {(*sptr2 + +) > > 3); 

} 

break; 
case 2: 

while (-count) { 

*sptr1 + + + = ((3 * (*sptr2+ + )) > > 4); 

*sptr1 + + + = ((3 * (*sptr2 + + )) > > 4); 

*sptr1 + + + = ((3 * (*sptr2+ + )) > > 4); 

*sptr1 + + + « ((3 * (*sptr2+ +)) > > 4); 

} 

break; 
case 3: 

while (-count) { 

*sptr1 + + + = <(*sptr2 + + ) > > 2); 

*sptr1 + + + = ({*sptr2 + + ) > > 2); 

*sptr1 + + + = ((*sptr2++) >> 2); 

•sptrl + + + = ((*sptr2+ + ) > > 2); 

} 

break; 
case 4: 

while (-count) { 

*sptr1 + + + = ((5 * (*sptr2+ +)) > > 4) 
*sptr1 + + + = ((5 * (*sptr2+ + )) > > 4) 
* sptrl + + + = ((5 * (*sptr2+ +)) > > 4) 
*sptr1 + + + = ((5 # ( *sptr2 + +))>> 4) 

} 
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break; 
case 5: 

while (-count) { 

*sptr1 4- + + = ((3 * (*sptr2+ +)) > > 3) 
*sptr1 + 4- 4- = ((3 * (*sptr2 + 4-)) > > 3) 
*sptr1 4-4- -I- = ((3 * (*sptr2+ +)) > > 3) 
*sptr1 4-4- + = ((3 * (*sptr2+ + )) > > 3) 

} 

break; 
case 6: 

while (--count) { 

*sptr1 + + + = ((7 * (*sptr2 + +)) > > 4) 
*sptr1 + + + = ((7 * (*sptr2+ + )) > > 4) 
*sptr1 + 4- + = ((7 * (*sptr2-h +)) > > 4) 
*sptr1 + + + = ((7 * (*sptr2+ + )) > > 4) 

} 

break; 
case 7: 

while (-count) { 

*sptr1 + 4- += ((*sptr2++) >> 1); 

*sptr1 4- 4- 4- = {(*sptr2+4-) >> 1); 

*sptr1 4- + 4- = ((*sptr2+ +) > > 1); 

*sptr1 4- + + « ((*sptr2 + +) > > 1); 

} 

break; 
case 8: 

while (-count) { 

long tmp; 

tmp = *sptr2+ +; 

*sptr1 4- + + = ((8 * tmp + tmp) > > 4); 
tmp = *sptr2+ +; 

*sptr1 4- 4- + = {(8 * tmp + tmp) > > 4); 
tmp - *sptr2+ +; 

*sptr1 4-4- + = ((8 * tmp + tmp) > > 4); 
tmp = *sptr2+ 4-; 

*sptr1 4-4- + « ((8 * tmp + tmp) > > 4); 

} 

break; 
case 9: 

while (-count) { 

*sptr1 + + 4- = ((5 * (*sptr2+ +)) > > 3); 

*sptr1 + + 4- = ((5 * (*sptr2+ +)) > > 3); 

*sptr1 + 4- 4- = ((5 * (*sptr2+ +)) > > 3); 

*sptr1 4-4- 4- = ((5 * (*sptr2+ 4-)) > > 3), 

} 

break; 
case 10: 

while (-count) { 
long tmp; 

tmp = *sptr2+ +; 
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*sptr1 + + + = (((tmp < < 3) + 3 * tmp) > > 4); 
tmp = *sptr2+ + ; 

*sptr1 + + + = (((tmp < < 3) + 3 # tmp) > > 4); 
tmp = *sptr2+ + ; 

•sptrl + + + = (((tmp < < 3) + 3 * tmp) > > 4); 
tmp = *sptr2 + + ; 

•sptrl + + + = (((tmp < < 3) + 3 * tmp) > > 4); 

} 

break; 
case 11: 

while (-count) { 

*sptr1 + + + = ((3 * (*sptr2++)) >> 2); 

*sptr1 + + + = ((3 * (*sptr2++)) >> 2); 

*sptr1 + + + = {(3 * (*sptr2 + +)) > > 2); 

*sptr1 + + + = ((3 # ( # sptr2+ +)) > > 2); 

} 

break; 
case 1 2: 

while (-count) { 
long tmp; 

tmp = *sptr2+ + ; 

*sptr1 + + + = (((tmp < < 4) - 3 * tmp) > > 4); 
tmp = # sptr2+ +; 

•sptrl + + + = (((tmp < < 4) - 3 * tmp) > > 4); 
tmp = # sptr2++; 

•sptrl + + + = (((tmp < < 4) - 3 * tmp) > > 4); 
tmp = *sptr2++; 

•sptrl + + + = (((tmp < < 4) - 3 * tmp) > > 4); 

} 

break; 
case 13: 

while (-count) { 

*sptr1 + + + = ((7 * (*sptr2+ +)) > > 3); 

*sptr1 + + + « ((7 * ( # sptr2+ +)) > > 3); 

*sptr1 + + + = ((7 * (*sptr2+ +)) > > 3); 

♦sptrl + + + = ((7 * (*sptr2++)) >> 3); 

} 

break; 
case 14: 

while (-count) { 
long tmp; 

tmp = *sptr2+ +; 

*sptr1 + + + = (((tmp < < 4) - tmp) > > 4); 
tmp = *sptr2+ +; 

*sptr1 + + + = (({tmp < < 4) - tmp) > > 4); 
tmp = *sptr2+ +; 

*sptr1 + + + = (((tmp < < 4) - tmp) > > 4); 
tmp = *sptr2+ +: 

*sptr1 + + + = (((tmp < < 4) - tmp) > > 4); 
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break; 
case 15: ' 

while (-count) { 

*sptr1 + + + 

*sptr1 + + + 

*sptr1 + + + 

*sptr1 + + + 

} 

break; 

} 

} 

} 

short SnDecompress(DecodeBook, ratio. frame_size, min_pitch, bstream, output) 

short ••DecodeBook. ratio; 

unsigned char *bstream; 

short * output, frame size, min pitch; 

{ 

short count, SampCount; 
register short dstate; 
short vcount; 

short vsize, cbook_size, bs_size; 

get_compr_pars(ratio, frame_size, & vsize, &cbook_size, &bs_size); 

dstate = *bstream++; 
dstate = (dstate - 128) < < 6; 

SampCount = 0; 

whilett'bstream & LAST FRAME FLAG) ==0) 
{ 

expand(DecodeBook, frame_size, vsize, min_pitch, 
(struct frame # )bstream, output, SampCount); 
bstream + = bs_size; 
SampCount +.= frame size; 

} 

vcount = * bstream - LAST_FRAME_FLAG; 

* bstream = * (bstream + 2 + vcount); 
expand(DecodeBook, frame_size, vsize, min_pitch, 

(struct frame *)bstream / output, SampCount); 

* bstream = vcount + LAST_FRAME_FLAG; 
SampCount + = vcount * vsize; 

count = (SampCount >> 1) + 1; 
while (-count) { 

*output++ = dstate = {(IPCONS * dstate) > > 3) + 'output; 

*output+ + = dstate = ((IPCONS * dstate) > > 3) + "output; 

output -= SampCount; 



= *sptr2 + +; 

= *sptr2 + + : 

= *sptr2 + +; 

= *sptr2+ + ; 
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return SampCount;- 

} 

^define FILTER state + PFILT_ORDER + 1 
^define DC_VAL state + PFILT_ORDER + PFILT_ORDER/2 + 2 
void SnSampExpandFilt(short *src, short off, short len, 
char *dest,short *state) 

{ 

short input, temp; 

long acc; 
register short dc = *(DC_VAL); 
register short # sptr1, *sptr2; 

src + = off; 
len+ + ; 
sptrl = state; 

sptr2 = state + PFILT ORDER; 
while (-len) { 

input = *src + -f - dc; 

dc + = input > > 5; 

temp = input + *sptr1 + +; /* (statelO] + state[8]) * filter[OJ */ 
acc = temp * * (FILTER); 

temp = *~sptr2 + *sptr1 + + ; /• (statelU + state[7]) * filter[1 ] •/ 
acc += temp * * (FILTER + 1); 

temp = *~sptr2 + *sptr1 + + ; /• (state[2] + state[6]) * filter[2] •/ 
acc + = temp * * (FILTER + 2); 

temp = *--sptr2 + *sptr1 + + ; /* (state[3] + state[5]) * filter[3] */ 
acc + = temp * * (FILTER + 3); 

acc += *sptr1 * *{FILTER + 4); /* state[4] * filter[4] •/ 

if (acc > 0) 
{ 

temp = (acc + (257 << 20)) >> 21; 
if (temp > 255) 
temp « 255; 

} 

else 
{ 

temp = (acc + (255 < < 20)) > > 21; 
if (temp < 0) 
temp = 0; 

} 

*dest+ + = temp; 
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sptrl -= 4; 
sptr2 -= 4; 

*sptr1 + + = *sptr2 + +; 
*sptr1 + + = *sptr2+ + ; 
*sptr1 + + = *sptr2 + + ; 
*sptr1 + + = *sptr2 + + ; 
*sptr1 + + = *sptr2+ + ; 
* sptrl + + = *sptr2 + + ; 
# sptr1 + + = *sptr2 + + ; 
*sptr1 = input; / 
sptrl -= 7; 

} 

*(DC_VAL) = dc; 



/* statelO] = stated] */ 

/* stated] « state[2] */ 

/* state[2) = state[3] */ 

/* state[3] = statel4] */ 

/* state[4] = state[5] •/ 

/* state[5] = state[6] •/ 

/* state[6] - state[7] */ 
state[7] = input */ 
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HI. BLENDING MODULE 



/* A module for blending two diphones */ 

typedef struct { 

short Iptr, pitch; 

short weight, weightjnc; 
} bstate; 

void SnBlend(pitchp Ip, pitchp rp, short cur_tot, short tot, 
short type, bstate *bs) 

{ 

#pragma unused (tot) 

short count; 
short *ptr1, *ptr2; 

if (type « = VOICED) 
{ 

if (cur_tot) 
return; 

{ 

short weight; 

long min_amdf; 

short bestjag = 0, lag; 

short window_size; 

short weightjnc; 

/* First replicate the left pitch period */ 

ptrl = lp->bufp; 

ptr2 = ptrl + lp->olen; 

count = ip->olen + 1; 

while (-count) 

*ptr2 + + = *ptr1 + +; 

/* Smooth the discontinuity */ 

{ 

register short en, e2; 

en = lp->bufpI2] + 

3 * (Ip->bufpt0] - lp->bufp[1]) - lp->bufp[lp->olen - 1]; 

e2 = lp->bufp[0] - lp->bufptlp->olen - 11; 



if (en * en > e2 * e2) 
en = e2; 
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ptr2 = lp->bufp + lp->olen; 
count- = (lp->olen >> 1) + 1; 
while (-count) 
{ 

*--ptr2 + = en; 

en = {((en < < 4) - en) > > 4); 

} 

min_amdf = LARGEJMUM; 

window_size = rp->olen; 
if (lp->olen < rp->olen) 

windowsize = lp->ofen; 

lag = rp->olen; 
while Hag) 
{ 

long amdf = 0; 

ptrl = rp->bufp; 

ptr2 « lp->bufp + lag; 

count = ((window_size + 3) >> 2) + 1; 

while (-count) 

{ 

short tmp; 

tmp = Cptrl - *ptr2); 
if (tmp > 0) 

amdf + = tmp; 

else 

amdf -= tmp; 
ptrl + = 4; 
ptr2 + = 4; 

} 

if (amdf < min amdf) 
{ 

bestjag = lag; 
min_amdf = amdf; 

} 

} 

bs-> pitch = lp->olen; 

/* Update left buffer */ 

if (bestjag < (lp->olen >> 1» 

{ 

/* Add bestjag samples to the length of left pulse"/ 
lp->olen + = bestjag; 

else 
{ 

/* Delete a few samples from the left pulse •/ 
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lp->olen = best lag; 

} 

bs->lptr = best Jag; 

weightjnc = 32767/ window_size; 

weight = 32767 - weightjnc; 

ptrl = rp->bufp; 

ptr2 = ip->bufp + bs->lptr; 

count = window_size+ 1; 

while (-count) 

{ 

*ptr1 + + + « ({(short) (*ptr2+ + - *ptr1) * weight) >> 15); 
weight - = weightjnc; 

} 

} 

} 

else 
{ 

register short delta; 

/* Just blend 15 samples */ 

ptr2 = lp->bufp + lp->olen - 15; 

ptrl = rp->bufp; 

r 

for (i = 1; i <. 16; i + +) 
{ 

*ptr1 = *ptr2 + (i * (*ptr1 - *ptr2)) >> 4; 
ptrl + + ; . 
ptr2+ + ; 

} 

•/ 

delta = *ptr1 - *ptr2; 

*ptr1 + -f = *ptr2 + + + (delta > > 4); 

delta = *ptr1 - *ptr2; 

*ptr1 + + = »ptr2+ + + ((delta) >> 3); 

delta = *ptr1 - *ptr2; 

•ptrl + + = *ptr2++ + ((3 * delta) >> 4); 

delta = *ptr1 - *ptr2; 

*ptr1 + + = *ptr2+ + + (delta >> 2); 

delta = *ptr1 - *ptr2; 

•ptrl + + = *ptr2+ + + ((5 * delta) > > 4); 
delta = *ptr1 - *ptr2; 

•ptrl + + = *ptr2+ + + ((3 * delta) > > 8); 
delta = *ptr1 - *ptr2; 



♦ 
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•ptrl + + = *ptr2+ + + ((7 * delta) > > 4); 

delta = *ptr1 - *ptr2; 

•ptr1 + + = *ptr2+ + + (delta >> 1); 

delta = *ptr1 - *ptr2; 

*ptr1 + + = *ptr2++ + (((delta < < 3) + delta) >> 4); 
delta = *ptrl - *ptr2; 

•ptrl + + = *ptr2 + + + ((5 * delta) > > 3); 
delta = *ptr1 - *ptr2; 

*Ptr1 + + = *ptr2+ + + (((delta < < 3) + 3 * delta) > > 4); 
delta = *ptr1 - # ptr2; 

*ptr1 + + = *ptr2+ + + ((3 * delta) > > 2); 
delta = *ptrl - *ptr2; 

•ptrl + + = «ptr2+ + + (((delta < < 4) - 3 * delta) > > 4); 
delta = *ptr1 - *ptr2; 

*ptr1 + + = *ptr2+ + + ((7 « delta) > > 3); 
delta = *ptr1 - *ptr2; 

•ptrl = »ptr2 + (((delta < < 4) - delta) > > 4); 



lp->olen -= 15; 

} 

} 
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IV. INTONATION ADJUSTMENT MODULE 



/* A module for deleting a pitch period */ 
/* 

Pointer srd points to Left Pitch period 
Pointer src2 points to Right Pitch period 
Pointer dst points to Resulting Pitch period 
len = length of the pitch periods 

*/ 

skip_pulses(short *src1, short *src2, short *dst, short len) 
{ 

short i; 

register short weight, cweight; 
i = len+1; 

weight = cweight = 32767/i; 

while H) 

{ 

*dst++ = *src1 + + + (((short) (*src2 + + - *srd) * cweight) > > 15); 
cweight + = weight; 

} 

} 

/* A module for Inserting a pitch period */ 
/ # 

Locn bufferlcurbeg] points to Left Pitch period 

Locn buffer[curbeg-+ curlen] points to Right Pitch period 

Pointer dst points to Resulting Pitch period 

curlen = length of the pitch periods 

•/ 

insert_pulse(short * buffer, short *dst, short curlen, short curbeg) 
{ 

short weight, cweight, count; 
short *src1, *src2; 

srd = buffer + curbeg; 

src2 = buffer + curbeg 4- curlen; 

weight = 32767 /.curlen; 

cweight = weight; 

count = curlen + 1; 

while (-count) 

{ 

•dst++ = *src1 + + = *src2++ + ({(short) (*srd - *src2) • cweight) >> 

15); 

cweight + = weight; 

} 

} 



/* This module is used to change pitch information in the concatenated speech */ 
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// This routine depends on the desired length (deslen) being at least half 
// and no more than twice the actual length {len). 

void SnChangePitch(short *buf, short 'next, short len, short deslen,short Ivocshort 
rvocshort dosmooth) 

{ 

^pragma unusedfrvoc, dosmooth) 

short delta; 

short count; 

short *bptr, *aptr; 

short weight, weightjnc; 

if (llyoc 1 1 {deslen = = 7en)) return; 

if (deslen > len) 
{ 

/* Increase Pitch period */ 
delta = deslen - len; 
bptr = buf + len; 
aptr = buf + deslen; 
count = delta + 1; 
while (-count) 

*-aptr = *~bptr; 

count = len - delta + 1; 
weight = weightjnc = 32767 / count- 
while (-count) 
{ 

register short tmp2; 

tmp2 m (*-aptr - *-bptr); 

*aptr = *bptr + ((tmp2 * weight) > > 15); 

weight + = weight inc; 

} 

return; 

} 
{ 

/• Shorten Pitch Period */ 
short wsize; 

delta = len - deslen; 
wsize = 2 * delta; 

if (wsize > deslen) 
wsize = deslen; 

weightjnc = 32767 / (wsize + 1 ); 
weight = weightjnc; 
aptr = buf + deslen; 
bptr = buf + len - wsize; 
count == wsize - delta + 1 ; 
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while (--count) 
{ 

*bptr+ + + = (((short) ( # aptr+ + - *bptr) * weight ) > > 15); 
weight + = weightjnc; 

} 

aptr = buf + deslen; 
bptr = next; 
count = delta + 1; 
weight = 32767 - weight- 
while (-count) 

{ 

*bptr+ + + = (((short) (*aptr++ - •bptr) » weight ) >> 15); 
weight - = weightjnc; 

} 

} 

} 
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CLAIMS 

What is claimed is: 



1 1 . An apparatus for concatenating a first digital frame of N 

2 samples having respective magnitudes representing a first quasi- 

3 periodic waveform and a second digital frame of M samples having 

4 respective magnitudes representing a second quasi-periodic 

5 waveform, comprising: 

6 a buffer store to store the samples of first and second digital 

7 frames; 

8 means, coupled to the buffer store, for determining a blend point 

9 for the first and second digital frames in response to magnitudes of 

10 samples in the first and second digital frames; 

1 1 blending means, coupled with the buffer store and the means for 

12 determining, for computing a digital sequence representing a 

13 concatenation of the first and second quasi-periodic waveforms in 

14 response to the first frame, the second frame and the blend point. 

1 2. The apparatus of claim 1 , further including: 

2 transducer means, coupled to the means for computing, for 

3 transducing the digital sequence to an analog concatenated 

4 waveform. 
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1 3. The apparatus of claim 1 , wherein the means for 

2 determining includes: 

3 first means for computing an extended frame in response to the 

4 first digital frame; 

5 second means for finding a subset of the extended frame which 

6 matches the second digital frame relatively well, and defining the 

7 blend point as a sample in the subset. 

1 4. The apparatus of claim 3, wherein the extended frame 

2 comprises a concatenation of the first digital frame with a replica of 

3 the first digital frame. 

1 5. The apparatus of claim 3, wherein the subset of the 

2 extended frame which matches the second digital frame relatively 

3 well comprises a subset with a minimum average magnitude 

4 difference over the samples in the subset, and the blend point 

5 comprises a first sample in the subset. 

1 6. The apparatus of claim 1 , wherein the means for 

2 determining includes: 

3 first means for computing an extended frame comprising a 

4 discontinuity-smoothed concatenation of the first digital frame with a 

5 replica of the first digital frame; 

6 second means for finding a subset of the extended frame with a 

7 minimum average magnitude difference between the samples in the 

8 subset and the second digital frame, and defining the blend point as a 

9 first sample in the subset. 

1 7. The apparatus of claim 1 , wherein the blending means 

2 includes: 
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means for supplying a first set of samples derived from the first 
digital frame and the blend point as a first segment of the digital 
sequence; and 

means for combining the second digital frame with a second set 
of samples derived from the first digital frame and the blend point, 
with emphasis on the second set in a starting sample and emphasis 
on the second digital frame in an ending sample to produce a second 
segment of the digital sequence. 

8. The apparatus of claim 1, wherein the means for 
determining includes: 

first means for computing an extended frame 
comprising a discontinuity-smoothed concatenation of the 
first digital frame with a replica of the first digital frame; 

second means for finding a subset of the extended 
frame with a minimum average magnitude difference 
between the samples in the subset and the second digital 
frame, and defining the blend point as a first sample in the 
subset; and 
wherein the blending means includes: 

means for supplying a first set of samples derived 
from the first digital frame and the blend point as a first 
segment of the digital sequence; and 

means for combining the second digital frame with 
the subset of the extended frame, with emphasis on the 
subset of the extended frame in a starting sample and 
emphasis on the second digital frame in an ending sample 
to produce a second segment of the digital sequence. 
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1 9. The apparatus of claim 8, wherein the first and second 

2 digital frames represent endings and beginnings respectively of 

3 adjacent diphones in speech, and further including: 

4 transducer means, coupled to the blending means, for 

5 transducing the digital sequence to an sound in speech synthesis. 

1 10. An apparatus for concatenating a first digital frame of N 

2 samples having respective magnitudes representing a first sound 

3 segment and a second digital frame of M samples having respective 

4 magnitudes representing a second sound segment, comprising: 

5 a buffer store to store the samples of first and second digital 

6 frames; 

7 means, coupled to the buffer store, for determining a blend point 

8 for the first and second digital frames in response to magnitudes of 

9 samples in the first and second digital frames; 

10 blending means, coupled with the buffer store and the means for 

11 determining, for computing a digital sequence representing a 

12 concatenation of the first and second quasi-periodic waveforms in 

13 response to the first frame, the second frame and the blend point; 

14 and 

15 transducer means, coupled to the blending means, for 

16 transducing the digital sequence to sound. 

1 11. The apparatus of claim 10, wherein the means for 

2 determining includes: 

3 first means for computing an extended frame in response to the 

4 first digital frame; 

5 second means for finding a subset of the extended frame which 

6 matches the second digital frame relatively well, and defining the 

7 blend point as a sample in the subset. 
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1 1 2. The apparatus of claim 1 1 , wherein the extended frame 

2 comprises a concatenation of the first digital frame with a replica of 

3 the first digital frame. 

1 1 3. The apparatus of claim 1 1 , wherein the subset of the 

2 extended frame which matches the second digital frame relatively 

3 well comprises a subset with a minimum average magnitude 

4 difference over the samples in the subset, and the blend point 

5 comprises a first sample in the subset. 

1 14. The apparatus of claim 10, wherein the means for 

2 determining includes: 

3 first means for computing an extended frame comprising a 

4 discontinuity-smoothed concatenation of the first digital frame with a 

5 replica of the first digital frame; 

6 second means for finding a subset of the extended frame with a 

7 minimum average magnitude difference between the samples in the 

8 subset and the second digital frame, and defining the blend point as a 

9 first sample in the subset. 

1 15. The apparatus of claim 10, wherein the blending means 

2 includes: 

3 means for supplying a first set of samples derived from the first 

4 digital frame and the blend point as a first segment of the digital 

5 sequence; and 

6 means for combining the second digital frame with a second set 

7 of samples derived from the first digital frame and the blend point, 

8 with emphasis on the second set in a starting sample and emphasis 

9 on the second digital frame in an ending sample to produce a second 
10 segment of the digital sequence. 
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1 16. The apparatus of claim 10, wherein the means for 

2 determining includes: 

3 first means for computing an extended frame 

4 comprising a discontinuity-smoothed concatenation of the 

5 first digital frame with a replica of the first digital frame; 

6 second means for finding a subset of the extended 

7 frame with a minimum average magnitude difference 

8 between the samples in the subset and the second digital 

9 frame, and defining the blend point as a first sample in the 

10 subset; and 

1 1 wherein the blending means includes: 

1 2 means for supplying a first set of samples derived 

13 from the first digital frame and the blend point as a first 

14 segment of the digital sequence; and 

1 5 means for combining the second digital frame with 

1 6 the subset of the extended frame, with emphasis on the 

1 7 subset of the extended frame in a starting sample and 

1 8 emphasis on the second digital frame in an ending sample 

19 to produce a second segment of the digital sequence. 



1 17. The apparatus of claim 16, wherein the first and second 

2 digital frames represent endings and beginnings respectively of 

3 adjacent diphones in speech, and the transducer means produces 

4 synthesized speech. 
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1 18. An apparatus for synthesizing speech in response to a 

2 text, comprising: 

3 means for translating text to a sequence of sound segment 

4 codes; 

5 means, responsive to sound segment codes in the sequence, for 

6 decoding the sequence of sound segment codes to produce strings of 

7 digital frames of a plurality of samples representing sounds for 

8 respective sound segment codes in the sequence, wherein the 

9 identified strings of digital frames have beginnings and endings; 

10 means for concatenating a first digital frame at the ending of an 

1 1 identified string of digital frames of a particular sound segment code 

12 in the sequence with a second digital frame at the beginning an 

13 identified string of digital frames of an adjacent sound segment code 

14 in the sequence to produce a speech data sequence, including 

15 a buffer store to store the samples of first and 

16 second digital frames; 

17 means, coupled to the buffer store, for determining a 

18 blend point for the first and second digital frames in 

19 response to magnitudes of samples in the first and second 

20 digital frames; and 

21 blending means, coupled with the buffer store and 

22 the means for determining, for computing a digital 

23 sequence representing a concatenation of the first and 

24 second sound segments in response to the first frame, the 

25 second frame and the blend point; and 

26 an audio transducer, coupled to the means for concatenating, to 

27 generate synthesized speech in response to the speech data 

28 sequence. 



1 



1 9. The apparatus of claim 1 8 f further including: 
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2 means, responsive to the sound segment codes for adjusting 

3 pitch and duration of the identified strings of digital frames in the 

4 speech data sequence. 

1 20. The apparatus of claim 1 8, wherein the means for 

2 determining includes: 

3 first means for computing an extended frame in response to the 

4 first digital frame; 

5 second means for finding a subset of the extended frame which 

6 matches the second digital frame relatively well, and defining the 

7 blend point as a sample in the subset. 

1 21 . The apparatus of claim 20, wherein the extended frame 

2 comprises a concatenation of the first digital frame with a replica of 

3 the first digital frame. 

1 22. The apparatus of claim 20, wherein the subset of the 

2 extended frame which matches the second digital frame relatively 

3 well comprises a subset with a minimum average magnitude 

4 difference over the samples in the subset, and the blend point 

5 comprises a first sample in the subset. 
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1 23. The apparatus of claim 18, wherein the means for 

2 determining includes: 

3 first means for computing an extended frame comprising a 

4 discontinuity-smoothed concatenation of the first digital frame with a 

5 replica of the first digital frame; 

6 second means for finding a subset of the extended frame with a 

7 minimum average magnitude difference between the samples in the 

8 subset and the second digital frame, and defining the blend point as a 

9 first sample in the subset. 

1 24. The apparatus of claim 18, wherein the blending means 

2 includes: 

3 means for supplying a first set of samples derived from the first 

4 digital frame and the blend point as a first segment of the digital 

5 sequence; and 

6 means for combining the second digital frame with a second set 

7 of samples derived from the first digital frame and the blend point, 

8 with emphasis on the second set in a starting sample and emphasis 

9 on the second digital frame in an ending sample to produce a second 
10 segment of the digital sequence. 
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1 25. The apparatus of claim 18, wherein the means for 

2 determining includes: 

3 first means for computing an extended frame 

4 comprising a discontinuity-smoothed concatenation of the 

5 first digital frame with a replica of the first digital frame; 

6 second means for finding a subset of the extended 

7 frame with a minimum average magnitude difference 

8 between the samples in the subset and the second digital 

9 frame, and defining the blend point as a first sample in the 

10 subset; and 

1 1 wherein the blending means includes: 

1 2 means for supplying a first set of samples derived 

1 3 from the first digital frame and the blend point as a first 

14 segment of the digital sequence; and 

1 5 means for combining the second digital frame with 

1 6 the subset of the extended frame, with emphasis on the 

17 subset of the extended frame in a starting sample and 

18 emphasis on the second digital frame in an ending sample 

19 to produce a second segment of the digital sequence. 

1 26. The apparatus of claim 18, wherein the sound segment 

2 codes represent speech diphones, and the first and second digital 

3 frames represent endings and beginnings respectively of adjacent 

4 diphones in speech. 



WO 94/17517 



1/17 



PCI7US94/00770 




DISPLAY 



AUDIO 
□UT 



16 



17 



ENCODED 
VOICE 
TABLES 



TEXT- 
TO- 
SPEECH 
CODE 



BUFFERS 



OTHER 
HOST 
MEMORY 



• TTS 

DICTIONARY 



• DIPHDNE 
TABLE 



NOISE SHAPED 
VECTOR QUANT- 
IZATION TABLE 
_ FOR ENCODING 

•NOISE COMPENSA- 
TED VECTOR 
TABLE FOR 
DECODING 



15 



•OPTIMUM BLEND 
POINT DIPHONE 
CONCATENATOR 



FIG.-l 



SUBSTITUTE SHEET 



2/17 



PCTflJS94/00770 



• RECEIVE INPUT TEXT 








TRANSLATE TO 
DIPHDNE STRINGS 


1 


f 



20 



DECOMPRESS 
DIPHONE STRINGS TO 
GENERATE VQ DATA FRAMES 



BLEND DIPHDNE 
VQ DATA FRAMES 



ADJUST 
DURATIDN DF DIPHDNE 
VQ DATA FRAMES 


i 


L_ 


ADJI 
PITCH DF 
VQ DATA 


UST 
DIPHDNE 
FRAMES. 



SUPPLY SPEECH DATA 
TD AUDID OUTPUT 



TEXT - TD - SPEECH CDDE 

FIG. -2 

SUBSTITUTE SHEET 



22 



GENERATE 
INTONATION 
CDNTROL DATA 



WO 94/17517 



PCT/US94/00770 



3/17 



Dlphone Record 



30 

32- 
33 





Lef "t DlDhone 


isiyri v JJiprnjiitr 




Left Pitch 


Right Pitch 




Period Coun-t 


Period Count 




Pointer to Left 


Pointer to Right 




Pitch Period 


Pitch Period 




Pointer to Left 


Pointer to Right 




Demi Data 


Demi Data 



•31 



LP NL-1 



LFRAMEq 



LFRAME 



LFRAMEml^ 



VQ 



RFRAME o 



RFRAME 



RFRAME MR _! 



Pitch Table Compressed Compressed 

Speech Speech 
Records 



VQ 
ipre 
pee< 
Records 



RPi 



RP 



RP NR-1 



Pitch Table 



FIG. -3 



SUBSTITUTE SHEET 



WO 94/17517 



4/17 



PCI7US94/00770 




Offset Compensation 

*n 

I 



Linear Predictive 
filtering 
Vn 



I 



51 



52 



Estimation of Pitch 
Filter parameters 
and Quantization 

p opt ' g 



I 



53 



Pitch Filter 



I 



Block Gain Estimation 



54 



55 



1 


^56 ( 


Residual Coding Using 
Full Search VQ Coder 


— * 


Store VQ String 
G Popt 



I 



Inverse Pitch Filter 



58 



Pitch Buffer Update 
PBUF 
I ~ 



59 



Inverse Linear 
Predictive Filtering 
<x_i Determination) 



60 



FIG. -4 



SUBSTITUTE SHEET 



PCT/US94/00770 

5/17 



PBUF 1 y n 

100 101 




N-l 



APPLY NOISE 
FILTER AND 
SCALE vy 



I 



123 



FIND POINTER 
TO BEST 
MATCH IN 
VECTOR 

QUANTIZATION 
TABLE 



ACCESS 
QUANTIZATION 

VECTOR 
USING POINTER 



USE FOR NEXT 
FRAME FILTER 
AND PBUF 
UPDATES 




.125 



FIG.-6 

SUBSTITUTE SHEET 



WO 94/17517 



6/17 



FRAME 0ftft 
DECODER r 200 




203 



Decode 
Paraneters 
G iP, P op t , 
VQ string 



Decode 
Residual 
Signal r ' 




Inverse Pitch 
Fll-ter 

y'n 



^""^204 



Sysnthesls Prtch 
Buffer Update 
SPBUF 



205 



Inverse Linear 
Predictive 
Filtering 




PCT/US94/00770 



125 



1 




ACCESS AND 




QV Q 
QVi 
QV 2 


CONCATENATE 


» 


QUANTIZATION 




VECTORS FOR 




• 


VQ STRING 




• 


202 




• 

QV 255 







FIG. -7 



SUBSTITUTE SHEET 



7/17 



PCT/US94/00770 



RECEIVE LEFT AND 
RIGHT DIPHDNE 



I 



STDRE LAST FRAME 
□F LEFT DIPHDNE 
IN BUFFER L n 



I 



STDRE FIRST FRAME 
□F RIGHT DIPHDNE 
IN BUFFER R, 



I 



REPLICATE AND 
CONCATENATE L n 



SMDDTH 
DISCONTINUITY 



FIND DPTIMUM MATCH DF 
R n TD El n 



< P 



op* 



300 



301 



302 



303 



304 



305 



BLEND 
El n AND R n 
WITH P t 



306 



FIG.-8 



SUBSTITUTE SHEET 





FIG.-9 

SUBSTITUTE SHEET 



WO 94/17517 



9/17 



PCT/US94/00770 




NOTES • 

T = Desired dura-tlon of a phoneme 
fjo = Desired Begin Ing Pl-tch In Hz 

f = Desired Ending Pitch In Hz 

PI, P2, . P6 are -the desired pl-tch period In 
No. of Samples corresponding -to "the frequencies 
fl,fe,...f6. 

Relationship between Pi and f I ■ 

Pi = Fs/fl,where Fs Is the Sampling frequency. 

FIG.- 10 



SUBSTITUTE SHEET 



10/17 



PCT/US94/00770 



INCREASE PITCH PERIOD 
TO N + A 


< 


> 


STORE PITCH PERIOD 
DATA IN BUFFER 
*n 


, 1 





GENERATE LEFT 
VECTOR L n 
VL <x n ,A,N> 
CBEGINING MOST SIGNIFICANT) 



350 



351 



352 



GENERATE RIGHT 

VECTOR R n 
VR <x n , N-A, N) 
(ENDING MOST SIGNIFICANT) 



353 



BLEND L n AND R n -A 
U + Rn-A 



354 



FIG.- 11 



SUBSTITUTE SHEET 



WO 94/17517 PCT/US94/00770 

11/17 




FIG. -12 

SUBSTITUTE SHEET 



12/17 



PCT/US94/00770 



DECREASE 


PITCH PERIOD 


TO 


N - A 


3! 


STORE 


TWO PITCH 


PERIODS 


IN BUFFER 







GENERATE LEFT VECTOR 
L n - WL <x n , V) 

CBEGINING MOST SIGNIFICANT) 



GENERATE RIGHT VECTOR 
L n = VR (x n r N i + N r , V) 
(ENDING MOST SIGNIFICANT) 



BLEND L n AND R n + A 



400 



401 



402 



403 



404 



FIG. -13 



SUBSTITUTE SHEET 



WO 94/17517 



13/17 




FIG. -14 

SUBSTfTUTE SHEET 



14/17 



PCI7US94/00770 



INSERT PITCH PERIOD 


BETWEEN 


L n AND R n 


1 1 




STORE L 


n AND R n 


IN BUFFER 







GENERATE LEFT 
VECTOR WL <l_n> 
(ENDING MOST SIGNIFICANT) 



450 



451 



452 



GENERATE RIGHT 
VECTOR WR <R n ) 

CBEGINING MOST SIGNIFICANT) 



453 



BLEND WR <L n ) AND WR <R n ) 
TD INSERTED PERIOD x n 



454 




455 



FIG.- 15 



SUBSTITUTE SHEET 



WO 94/17517 



PCT/US94/00770 



15/17 











N 






1 ^ 









Weighting Function 




FIG. -16 

SUBSTITUTE SHEET 



16/17 



PCT/US94/00770 



INSERT PITCH PERIOD 
R n WHICH FOLLOWS L n 



I 



STORE 


L n AND R n 


IN 


BUFFER 



500 



501 



GENERATE LEFT 
VECTOR WL <L n ) 
(BEGINING MOST SIGNIFICANT) 



502 



I 



GENERATE RIGHT 
VECTOR WR <R n ) 
(ENDING MOST SIGNIFICANT) 



503 



BLEND WL <L n ) AND WR <R n ) 
TO CREATE RESULTING L' n 



504 



REPLACE L n — - R n WITH 
L' n IN PITCH PERIOD STRING 



505 



FIG.-17 



SUBSTITUTE SHEET 



17/17 



PCT/US94/00770 







N 


N 










N 



Weighting Function 




WL<L n ) 0 WR(R n ) 




N 



N 



N 
+ 




V<L n > + V<R n > 




FIG. -18 



SUBSTITUTE SHEET 



INTERNATIONAL SEARCH REPORT 



International Application No 

PCT/US 94/00770 



A. CLASSIFICATION OF SUOJECT MATTER 

G 10 L 3/00, G 10 L 5/02, G 10 L 7/02, G 10 L 9/14 

According to (ntemabonal Patent aasaficaoon (IPC) or to both national classification and IPC 

B. FIELDS SEARCHED . 

Minimum documentation searched (classification system followed by classification symbols) 

G 10 L 3/00, G 10 L 5/00, G 10 L 7/00, G 10 L 9/00 

Documentation searched other than minimum documentation to the extent that such documents are included in the fields searched 



Electronic data base consulted during the international search (name of data base and, where practical, search terms used) 



Category * 


Citation of document, with indication, where appropriate, of the relevant passages 


Relevant to cla 


ira No. 


A 


EP, 


Al, 0 515 709 


1,10, 








{ IBM) 02 December 1992 


18 








(02. 12.92) , 










fig. 1; abstract; 










claims 1,6,7. 






A 


EP, 


Al, 0 232 456 


1,10, 








(AMERICAN TELEPHONE AND 


18 








TELEGRAPH COMP • ) 










19 August 1987 (19.08.87), 










figs. 1-4; abstract; claim 1. 






A 


EP, 


Al, 0 140 777 


1,10, 








(TEXAS INSTRUMENTS FRANCE) 


18 








08 May 1985 (08.05.85) , 










figs. 1,3; abstract; 










claims 1-9. 

















□ 



Further documents are listed in the continuation of box C. 



□ 



Patent fanuly members are listed in annex. 



' Special categories of a ted documents : 

# A' document defining the general state of the art which is not 

considered to be of particular relevance 
"E" earlier document but published on or a Oct the international 

filing date 

'L* document which may throw doubts on pnority daim(s) or 
which is cited to establish the publication date of another 
citation or other special reason (as specified) 
O" document referring to an oral disclosure, use, exhibition or 
other means 

■p' document published prior to the international filing date but 



HP Uter document published after the international filing date 
or priority dale and not in conflict with the application out 
oud to understand the principle or theory underiying the 
invention 

*X* document of particular relevance; the claimed invention 
cannot be considered novel or cannot be considered to 
involve an inventive step when the document is taken alone 
m y document of particular relevance; the claimed invention 
cannot be considered to involve an inventive step when the 
document is combined with one or more other such docu- 
ments, such combination being obvious to a person stalled 
in the arc 

document member of the same patent family 



Date of the actual completion of the international search 

18 May 1994 


Date of mailing of the international search report 

3- (fc94 


Name and mailing address of the ISA 

European Patent Office, P.B. 5818 Patentlaan 2 
NL - 2280 HV Rijswijk 
Tel. ( + 31-70) 340-2040, Tx. 31 651 epo nl. 
Fax ( + 31-70) 340-3016 


Authorized officer 

BERGER e.h. 



Form PCTVKA/ata <«econd tHe«i) (July WfJ) 



:uci international en Recherchen- 
bericht fiber die internationals 
Patsntanoieidung Nr. 



ANNEX 

to the International Search 
Report to the International 
Application No. 



Patent 



ANiMExe: 

au rapport de recherche inter- 
national relatif a la deflande de brevet 
international n* 



In diesem Anhang sind die Mi tglieder 
der Patentfsfliihen der in cbenge- 
nannten internationalen Rscnerehenbericht 
angetuhrten Paten tdokwnente angegeben. 
Diese Angaben dienen nur zur Unter- 
richtung und erfolgen chne Gewahr. 



PCT/US 94/00770 SAE 85896 

This Annex lists the patent family 
aenbers relating to the patent documents 
cited in the sbove-flientioned inter- 
national search report. The Office is 
in -no way liable for these particulars 
which are given merely for the purpose 
of information. 



La pr^sente annexe indique ies 
aeobres de la fatoille de brevets 
reiatifs aux documents de brevets cites 
dans Ie rapport de recherche inter- 
national visee ci-dessus. Les reseigne- 
cents fournis sont donnes a titre indica- 
tif et n'engaosnt oas la resoonsibilita 
de 1' Off ice. ' 



Iflfi Recherchenbericht 
anoefuhrtes Patentdokument 

Patent document cited 

in search report 
Docuaent de brevet cite" 
dans Je rapport de recherche 



DatUA der 
Veroffentlichung 
Publication 
date 
Date de 
publication 



Hitglied(er) der 
Patent farai lie 
Patent family 
oanber (s) 
Nefflbre(s) de la 
faaille de brevets 



Datum der 
Veroffentlichung 
Publication 

date 
Date de 
publication 



EP Al 
EP Al 



515709 
232456 



02-12-92 
19-08-87 



JP A2 5197398 



06-08-93 



CA Al 1318976 

DE CO 3685324 

EP Bl 232456 

JP A2 62159199 

US A 4827517 

US E 34247 



08-06-93 
17-06-92 
13-05-92 
15-07-87 
02-05-89 
11-05-93 



EP Al 



14077 



08-05-85 



DE CO 3480969 
EP Bl 140777 
FR Al 2553555 
FR Bl 2553555 
JP A2 66102697 
US A 4912768 



08-02-90 
03-01-90 
19-04-83 
11-04-86 
06-06-85 
27-03-90 



